B» 611 

iVtbor 

TITLE 



IBSTITOTIOIf 

SPOHS A6E8CI 

PDB DiTE 
COSTRICT 
HOTS 

EDBS PBICE 
DBSCBIPTOHS 



]>OCOBBIT SBStRTS 

0 

95 ^ 



T8 006 019 



Sarsball, J. Laird ' . 

TLe il«aa split- aal'f Coefficient of Agree lit and its 

Selation to Other Single- Jld»inistrat ion Test tadicos: 

A Stody Bhsed on siMlated Batq^ Technical Be port Bo- 

350,. ■ ' * , 

Wisconsin Oniv* , Hadison, Research and Derelopaent 

Center for. Cognitive Learning. 

Hational Inst., of Edacation (DBE«)i iashington, 

'D.c. - ' - J ■ : 

Jnn 76 ' 
HB-C-00-3-0065 
2p1p- 



1,37^1 



IDBHTIPIEBS 



ABSXBACT 



HF-$0i8^ HC--$11, 37^Plus Postage. . 
Coipoter prtfVtaps;.. ♦Criterion Beferenced Tests; 
Decision Hakinc; Hatheiatical Hodels; Hor« Beferenced 
Tests; Siittlation; standard Error of Heasnreaent; 
•Statistical Analysis; ♦Test Beliability; True 
ScoxrGS 

♦CoefficieiLt Beta; Bean Split Balf Coefficient of 
Agree »ent; Test Theory 



■ A snaaary is provided of the rationale for 
qaestioning the applicability of classical reliability aeasures to 
criterion referenced tests; an extension of the classical theory of 
trne and error scores to $5pcorporate a t>eory of dichotoBoos 
4:ecisions; a presentation of the. »ean split-half coefficient of 
agreeaeut, a single-ad sinistra tion test index' designed to aeasnre the 
internal consistency of dichotosoas classifications; and information/ 
concernllpg the propertiejs, tmder varying conditions, of this ne» 
coefficient and several ot/Eer single-jidainistration test indices, ^as 
•ell as their -interrelationships, sisnlate'd data were nsed to prbvide 
answers to gnestions aboat the behavior of coefficient beta relative 
.to variations in score distribntibn,, criterion -levels naaber of* 
eraainees, nuaber o*- iteas, and centain .beisic test statistics. It vas 
deteriined that coefficient beta increases us the nnaber of itees 
increases, bat in a aanner diffei;ex»KfrOe thafc- predicted by the 
Spear Man- Broun praphecf^oraula. If^^te also shoim tha^t the value of 
the coefficient increases as th>ft bullTSk^scores'^ departs fros the 
criterion catoff. Belationships bet'fSeir^goeff^.cient beta and other 
test indices are presented. '.Host^proainenlnjtiong these, is the 
indication that for nniaodal score distributions, coefficient beta 
and Livingston's criterion referenced r<aliability. coefficient have 
siailat ranges of value- and fluctuations over criterion level, 
whereas this relationship does not hold for biaodal distributions, 
sine© coefficient beta is sensitive to the. aodets) of the score 
distribution while Livingston's coefficient is sensitive to the test 
•ean. (Author/RQ 



DocumeOts acquired by ERIC include nmny InfomjAl unpubUthed toatertali ool avidUble from otb« soutobs. ERIC roaktES «ftry 
effort to obtain toe beat ropy atfiOablc. Nevcnhdea. liems of rowglMJ TtprodutibUity «« ofteo encount««d »nd this afrecw the 
qu ^ llse ri«cr-ff-t>p juid hardcopy reproduetionj ERIC make* tvaUable vU t^e ERIC Docuroent fleproductJon Service (EDRS). 
EEERLCotrcfp<m«tbk,1tor the quiJJty ol the original document, Reproductlotu tupplled by EDRS are the best that car» be made from 



to 



I 

TtCHNfCAl REFORI NO 3S0 



the mean 
spiU*haif 
coefficient of 

agreement and 
its reiotion to 
other single- 
admjhistration 
test indices: a 
study based on 
simulated data 



-Q 
O 



- JUNE 1976 



WiSCQNSiN RESeARCH 
AND. {)6VElOPMENT 
• ' . CENTER fOR 
•COGNiTfVE I EARNING 




ERIC 



U i. Of ^ACYMffit 0* Mf AtTM 
iOuCAfkO«f 

»...\ oc<«^MrwT MAS «ifN «Hfi«*?*0' 



Technical Report Ho. 350 



THE MBAH SPLIT-HALF COEFFICIENT OF AGREEMENT, 
AND ITS BELATIC3N /TO OTHER SINGLB-'ADKrHI^TRATIC*! TEST XKDICES: 
* A STUDY BASED OH SlHOLATED DATA 



by 

J. Laird Marshall 



Report from the Project on Conditions , 
of School L€arnin<3 and Instructional Strategies 



ThoofiW A. Romberg 
y . Fagulty-Assopiat^e 



Wisconsin Research and Deivelq?in©nti 
Centidr foi Cognitive Yearning 
the university »of Wisconsin 
^ Madison, Wiscorisin ' / 

June 1976 - / 



/ 



• ./ 

V" 



3 



Put>il8h«4 by th* ViMomin lto»#aro»i Md Dev«lopMnt Cnt^r tor Cognitly* UmruinB. 
supported tn pmrt aA.ji'r«i««voh and d«velppMnt emUr by ftmdt f^,th«.Hfttlon»l 
Institute of Edudfttloit^ Dep^urtMnt of R«mlth» Sdueation^ «nd Woyai;«. nk opltdoM 
•xprested herein do not neoetMrilT roHeot tlio positlbn or poller thif National 
Institute of Sduoetlon 4\nd no offioiU endorMMnt bylttat efMCiir ehouUd be inferred. 

, ■ .: ■ . ^ ' f ,. ■ ^' ■ / / .: . 

CenU*» Contreot Wo. il-C* Op65 / 



WISCONSIN RESEARCH AND DEVEtOPHENT 
CENTER FOR COGNITIVE LEARNING 



MISSION , . , . 

The mission of the l^iscoMin Reaearch audi Develojawnt Center 
for CogrtiUve liWuming is to help laamars devali i>,as rapidir 
and effectively rfa possible their potential as human beings 
and as contributing mMbawk^f nociety. The RfiD Center is 
striving to fulfill thirf goah. by , 

• conducting research' to discover more about 
how clfctldrWi learn . 

• developing improved instructional strategies, 

^. processes and materials for school administrators, 
teachers, and children, and 

• offering assistance to educators and citizens 
which will help tjjAnfffer ftie butcomeo of research 
and davelopeient into practice- 



PROGRAM/ 

The activities of the Wisconsin R&D Center are organised 
around one unifying thane. Individually Guided BducaUon. 



FUNDING 

The Wisconsin lUD Center is sugpoi^sd with funds from. the 
national Institute of BducatjkJhi the Bureau of Education for 
- the Handicapped, U.S. Ofjk^ of Education* and the University 
of Wisconsin. 



iii 



ACKKOWLEDGHEKTS 



I would like to express my {(tatitude: 

• to the Wisconsin Research and Develaf>*ent Center for Cognitive 

• -• ' . ^ . . . ... ... ... . .. ... . • - ,. - • , 

Learnings for providing mo with working space^ coorputer time, 
printing costs, and graphic and editorial assistance; ^ 

• to my coiwDittee as a wh^le, for their requircaM^ni^ that I restrict 
my scope and delineate my plans, and individually 

to Robert L; Thomdike, Chairman^ fox his vri'sdon^ insightful 
criticism, and unwillingness to let ne.get away with very 
much; 

to Euth Z. Gold^ for her warmthV^pport and demands for 

"^clarity; ■ , 

to Jeremy Kilpatrick^ for his patience, encour^ement» support^ 
and helpful editorial suggestions; 

• to my close friend and col league-,--Bd-fteejrt^4>~f<>r-beingv-<>ver^ — — 
^ period of years^ an excellent coorputer prograamer, idea ^ource^ and 

late-night intellectrual sounding board for important parts of tii« 
document; 

• and to*my best friend (and wife), Nancy Marshall, for being'^ho she 
is; ;ilth\>uf»h we 3it"c both educati^onat psychologists of sorts, my 

« field is numbers and formulas, and hers is people and feelings; the 
fiict that I ;im. writing; this is due in large measure to her having 
practiced her area of expertise on me. 



iv 



TASiE OF co^^r£^^^s 

Adtnowledgracnts. , . ^ . . ;^ . . . . • 

List of Tables . * vli . 

list of Figures, » . . ix 

Abstract , , . . ' ^ ^ .xitl 

I, Introduction . • 1 

Behavioral Objectives, Individualized Instruction, 

* and Mastery Learning . ; • - • * • * 

Criterton-Referenccd Tests. . ... * . . . - * . . . ^ • ^ 

Overview. . * . ^ 

II. ..Related Test Theory. ' * ^ 



» * . * 



Purpose iof a-Te,5t 

Score Distributions . . . . . . * . . • • • • • • • • - 

Test Specifications and Item S«i lection ' ; / • < ^ , 13 

The i^lathematical Mode f and Ertors of Measurement, ...... 14 

Meaning-^f Reliability. . . . . • r * . • ... 20 

in. Coefficient Beta: The Mdan i?plit-half Coefficient . 

of Agreement. 25 

History and Rationale . . . . . 25^ 

Definitions . . A * . 26 

% - - Analy^is-bf ihe:Coefficient , > . . , . h . . 28 

The Coefficier : • • ^ ^ i ....... . v; . . , 31 , 

Adjustwahi for Odd n. - * . . . . ^ . . . * . ^: » 32, 

Technical Characteristics of Coefficient Betfi, . . * * * v ^5 

Discussion* . . . < • • • • ^ . • . . . v, ^ 38 

Co^fficieijft Beta and Trii^hotomous Data. > * * • • - ' \ * ^ 40 

IV. Other ?lng/e-Administration fcic>effic^ , > • * " 45 

Livingston's Criterion-Referenced Reliability 

, Coefficient; . ^ . 45 

Harris's Index of Efficiency, ♦ ♦ . • . v f • 48 

The Index of Separation ♦ * * SO 

- Other Fovrrfoi4 Tablf Test Indices . . » . . . • •> . • • 53 

V. Pocus of the Stt«dy, Data Generation, and Analytical ' ^ 

c . Method. . . r^, ^ • > 59 

Focus of the Study. ^ i * . ^ . 59 

, The Cowputer Pptfgraa. . * .... . ♦ * > 60 

The Questions and Research Methods. . . . . > 74 



EKLC 



Table of Contents (cont.) 

^ • * ■' ' ' ^ - «.. 

VI, Results and Cohclu5v>n5. » . * . • • « .77 

Characteristics- of Coefficient &oja 77 
Characteristics of Livingstones k^' . ;95 

(ilharacteristics of Harrises u^. * 102 

Characteristics of S • * * * « • . . . U2 

c- . I 

Relations Among Criterion -Dependent Indices > , ^, » • * * 119 

yiK Suitmary ^n<J Sugjjestions for future Research, . , , . . / . . r 

Suftsmary * . . \ • ♦ . . ^. ^ > l^^S 

Suggestions for Further Researdf^^ . . v . . . * ^ ^ • . i^O 



References • . * . . / > • 

... ^ . , 
Appendix A f SuppligDentary Algebraic Ocrivatfons * .149 

Appendix B: Graphf< of <^(X) for each Score X, for > 

Sclec^ted Criterion levels >nd Number of -Iteias • . 15S 

Appendix C: Coopator Program Input iParaiq^ter.Dist^^ 

\ ' and Subroutines^ with Notes on Calculation of 

Vector Cofijppnents V 165 

Appendix U: Sumafies of Stepwise Analyses of Regression, ♦ , 169 

Appendix E: A Binomial Model for Stepping Up^oefficient 

B^e tci-» > 4 » ! • * <* « « • « -f * * # 17S 



. ' 8 .. ■ / : 

vl .:' ■ '■ 

ERIC 



■ Table * 

1. Errors of *Mea5UTement under Two Tsnjic -Score Ho'deU. * / . . 

Sclrccia for Dual True -Score Model 20 v 

.> A Fourfold Table for True and Observed Ciassif ic$itions . , 21 ^ 

4 Input ParoTOters Used for. the Suady. . . , . . . . ?3 

5 Values of ^(X) for n = 20, c, = .... v . 

6 Ordinal Rank of Each Distribution on the Variable 
Indicated at Top of Colusan, 1 tow, 8 « High; . ,j . * . * 

7 Values of Spearman'^ s Rho (Rank-Ord^r CorroUtion? ' 

Between win 6, and Basic Test Statistics:-, , S5 

8 Values of SpeanaaJf^'^^f Rho (5lank-0rder Correlation) 

' " ' 2 ^2 ' in- 

Bet ween max (u )* u aftti Basic Test Statistics 
^ c c - . 

9 Extreme Fluctuations in r f , 128 

cospi 



■ * 

■ ' ( . . . ' 

m 

'"' . - vli . . I ^ 

■ • ' . ■* 

o . ■ 
ERIC 



LIST uF Figures 



8 
9 
10 
il 
12 
15 
M 

'19-22 ' 
25 



t,{X) for s 20-It.«iB rem Two Criterion Uvels • • ■ • 

Tvo H>Tpothetacal Score Distributions . ..... ^ . . 

A Cathode-Ray Tube Analog>^ for the Cotaputer Programs. . 

Relationships 8eiwt5<^n Uc^, Exziiainee, and Test 
Chiiractoristics; A Cosjparisoa of tho Classical (A) und 
Computer (B) Models. 



\ 



ncs of a ^*prl 



mally Oi\5.tributed 



Histo^rwn of Co?apone^ 
Competence Vector (c^) 

{listogr am^o f Co?TJponents of a Biinodal Competence 

Vector ^c ^ . * * - . 

P 

Score Distribution Resulting ^rom Paraaseter Set i 

Scope Distribution Resulting Ffom Paraaetcr Set 2 

Score Oistrxbution Result ing fro^ Pararscter Set 5 

Score t>sitribution Resulting fro^a Parameter Ser 4 

Score- O'ijftribut ion Resulting irm Parajtaetor Set S 

Score Distribution Resulting from Pafaajeter Set 6 

Score Distribution Resulting frc^ Paraaeter Set 7 

Scort> nistribution ??eHuJting frO(# Parameter Set 8 

Graphs oi^Coef f icient beta against Criterion Level, 
^ith Score Distribution Jftelativje 'Frequencies, for 
Pararaetcr Sets 1-4 , . , 

Graphs of Coeff iitr,^ beta against Criterion Level , 
w^ith Score Oistubutior; Relative Frequencies, for 
Paranjttter Sets v-S . ^ . * « - - » 

Scatterplot of S for (or AH) Exaisinees against 

8 for ?i Exaainees. ^ . * - ^ . 

. • . ? 

Scatterplot of 8 (and a) fcrr 2n Ite^ns^ agaijjst 3 fand a) 
for n Items. ^ > * * 



ragg^ 
6? 



6o 
67 
6S 
69 
69 
70 
70 
71 



BO 



81 



86' 



S8 



ERIC 



4 ft 



\ 



lisx of V'i^^uTcs (cent.) ' * 

Scatterplot of B for 2h I i^m against S for n Ucr^ 
for .i NonaaJ Oi^tribucion. . , * . . . ^ * ^ . * 



n Items 



c 



Graph-;^ of against Percent Hastcry. for Parai»ctor 



2 ' ,2 
46 Scatterplot of for 2N (or 4N) Examinees against 



> for N Exajainecs. 



90 



9*. 



26 Scatterplox of B for 2n Itcm^ ijj^ainst S for n It<j?^s 
for Uniforta Distribution / . . ^ . « v - ' 

27 Scuttcrjjlot of • B for in lto»s 3jjiainr:.t B for n Ir-en^s 
for a Wirr^dal Uxstribution * . - 

-> 

>I Graphir of k"'.^^ against Criterion Level, with Score 
Di^jtribution Relative Frtsqi ^ncies, for Para^^octjr 

:>S Graphs of .against Critoritm, Uvel * ^«rith' Score 

nx^>tribution Reh^tive Frequencies* for Paraxseter 

Sets S«8 f ; 9? 

5n Scatterplot of k"' for 2H (or 4N) ExasJiinecs 'against 

^ . . oo 
k*';^^ for >i E^a©inees ^ 

37 Scatterplot of k ^^j for 2n Itetas ajjainst ^ for 



38-41 Graphs of aj^^ainst Percent Mastery, for Paraih^ter 



' 104 



lOS 



109 



7 * 2 

Scatterplot of u" for 2n I teas against fov n Itea^, . . lU 



48 Si Graphs of against Criterion Uvel , with Score 

Distribution Relative Frequencies^ for Parameter 

Sets 1-4 * • - . . li:. 

Graphs of S^ against Criterion Level, with Score 

Oistribution Relative Frequi?ncie$ for Parameter 

Sets S-8 ..... , . . 114 



• ■ / 

Sfi' ■ Scat tcr;i<fot of for (or.^ -Ui) Etn.^incftS Agci in*.? 





















Sc^it tcrplot 


if or 

c 




s"- 


for r- ! • 




sa 




vs. 


f' r 5 f iYT 1 nr. 

H.^ i * V 4 i ^-.i ^ .» 


Uvei : 




S« t 


J . ...... 




!> 9 




v*> . 




lev/! i \ 






, . ■ ' . . . 


* ^ 5 






V . 


Criterion 






S«r 






t>l 






Cr 3 tenon 


Uv<r 1 ; 


Parar^^oter 


Set 


. . ..- .. . 




62 






Criterion 


Uvei ; 


pirrtTS3cr«:r 


Set 






o5 




V % : 


Criterion 


Uvel . 




Set 


6 ..... .. 






Indices 


vr> . 


Criterion 










, 1 2fc 


65 


Indices 




Criterion 


Level , 


F;irasct(*r 









jL 3 



ERIC 



AB5TRAC . 

Thf report pro-/ ides a sumary of the ratioi^le for questioning 
the applicability of classical reliability oeasures to criterion- ' 
referenced tests; an extension of the classical th»*ory of true and 
error 5Core;i^ to incorporate a theory of Kilchoto©*^5 decisions; a ^ff^- 
seniation of the jsean split -half coefficient of agreetsent, a singte- 
sxiainistration test ndex designed to taeasure the internal consistency 
of dxchotoeous classifications.; and infomacion concerning the proper- 
ties, under varying conditions, of this ncK coefficient and several 
othf/r 5ingle-adainiT>tration test indicci. as well as their interrela- 
tionships. . 

StJBulated data were used to provide answers to questions about the 
behavior of coefficient beta relative to variations in score distribu- 
Clon, criterion level, number of exajainees, m<j»bf5r of iteas, and certain 
basic test statistics. It wa,s detenained that coefficient beta in* 
creases as the nuabet of iteas increases » but in a aaitnet different 
fro© that predicted by the ^pcaxTJin-Bro^n prophecy formila. It was 
aiso shown that the value of the co fficicnt increases as the bulk of 
scores departs from the criterion cutoff. 

Relationships between coefficient beta and other test indices are ' 

presented* Mo^t procsinent aaong these is the indication that for uni- 

2 

modal icore distributions, coefficient beta and Livingston's k ^ 
have Siwiiar raJiges of value and fluctuutions over criterion level, 
»<h«rea3 this relationship docs not hold for biaodsl distributions. 

xlil 



since coefficient beta is scnsiiiyc* to the m(H!o{sJ of ihe score 
distribiitioM while k ^.^ is i^onsitivci the test rw?an. 



/ 



14 



CHAPTER t . — . 
IKTROOUaiON 

Bahavioral Objectives, Inidivldualized Instmction^ and Mas * ory Learning 

In the past dfecade, uducatoxs have given an increasing aaount of 
attention to the related ideas of behavioral objectives, individualized 
instruction, and mastery learning.- These ideas aay be nbthing Jtiore 
than what good teachers have been using or woriing toward for centuries, 
but it^cannot be denied that formaiitirtg and labeling thewi has had and 
tdll continue to have a great impact on education. 

The notion, rthat a curriculuoa, or at least iaportant parts of it, 
can succef.$ful ly be broken down into sets of behavioral objectives i 
has beeti advanced by several authors (e.g., Cagntf,; 196S>, and within 
the past few years there has been a progression from the theoretical to 
the practical, froa schola:^ly articles to the copoercial educational 
marketplace. Such conunercially available programs as the Wisconsin De- 
sign for Reading Skill Qeyelopttent (Otto 9 Askov^ 1974), Developing 
Mathewatical Processes (Developing Mathematical Processes Staff, 1974), 
and Sciencif--A Process Approach (American Association for the Advance- 
Mot of Science -Coaission on Science Education, 196S)^'ta^ representa- ^— ^ 
tive of this »ove froto theory into practice. 

But educational reform has not stopped with the development o^^ 
oirricula based at least in part on behavioral objectives. Along with 
the shift toward objectives has cc»;ie an increased enphasis on flexibility.^ 



15 



in instruction, to give each pupil (at least in theory) a better chance 
of receiving the kiod of instruction , that best meet,s his needs. One 
reason for suph a syst£w o£, individual ized instruction (Klausmeier, 
• Quilling, Sorenson, Way, S Glasrud. 1971) is that individuans in a 
given group do not all leant a given set' of naterials ,at the same rate 
or by the same »ethods, a fact which has been all too painfully ob- 
vious to generations -of teachers fa ed with pupils on one end of the 
ability spectrum who exhibited boredom and pupil;^ on the other end 
who felt frustrated when they have used a pace any form of presentatiori 
appropriate for some pupils in the middle. 

A system of behavioral t)bjoctlv6s and individualized instruction, 
however, offers hope: the objectives allow the teachet to concentratd" 
on a di crete block of material, and individualization ijiproves the 
chances that a given student w-M spend neither more nor less time on 
• tifc Butteriai thaj» is needed. ► s, of course, raises the question, 
"jiow ouch time is 'enough'?'* Althou>;h this question is so open-ended 
as to have frustrated taany theoreticians and researchers, a good bit 
has been written on the topic, which has come to be known as the " 
"mastery learning" issue. While much of the'current interest in ar- 
tery learning was given impetus by an article by {||oom (1968) the - 
underlying philosophy has profited from contributions of many Waiters 
(e.g., Carroll, 1963). - 

One can easily discuss mastery learning in a theoretical way, but 
to make the concept operatitoAal- in a classroom means defining mastery 
for a civcn behavioral objective, and this in turn necessitates describ- 
ing the method by which mastery is to be assessed. This description 



1'6 



does not usually prt^ent too great a difficulty; if a behavioral ob- 
ject! veTsVxplicitly stated, it is generally possible* to explicate 
how mastery can be assessed. Evans (1968) claims, however, that the 
behavioral objectives are less important operationally than the assess 
ment instnanent; He umintains that tlje posttest, not the list of be- 
havioral objecT^ves, is the ultimate operational, measure of What a 
teacher is trying to teach. While mastery may sometimes have to be 
assessed by soisewhat uncommon methods, this report will only concern 
itself wiith the familiar paper- and-pehcil t§et format. 

' Criterion-Referenced Tests 

. — ^ ^ 

Th€re are several kinds of instruments whose stated purpose is to 
;assess mastery. • TheV differ in the number of objisctiyes involved, the 
number of items per objective, nomenclature, the meaning of criterion, 
and the interpretation given to the test results* ^ 

Some tests measure only one objective (DMP Staff , 1974); others 
encompass I several objectives. Of .these, some test each objective with 
a single ^test item (Gessel, 1972) whije others require more than one. 

theJe are several names given by Various writers to these assess- 
ment instruments: mastery test , objectives-based test , objective- 
referenced measure , domain-referenced test , and criterion-referenced 
test , Tljis last term,- introduced oyer a. decade ago (Glaser, 1|9^3) has 
gained perhaps the widest currency* Such widespread use has also re- 
< suited in widespread abuse, since this single term is employed to 
cover a range of test types and interpretations.. Recognizing this 
problem, Donlon (1974) and Miilman (1974) Jtave offered schemata for 



labeling various kinds .of jcfiterion-ToferenGe'i tests. 

In addition, scl»e autWs Idisagr^ on thejaeaning of the word 
criterion . Som witers (e.g., Nifkoi 1971) aaintaln that criterion 
neans so»e observable *stajdard of perfpnonce; others (e.g., Harris & 
Stewart, X971) define it as a specif iedw percOT^age of correct responses 
on test ite«s. Some writers; indicate tJfat intefR^etation of the ^est 
results should take into Recount how iwn^/itepis were responded to corr 
rectly Cr how far fro* the criterion th<ilexa«inee's score lies, whereas 
others aaintain that the 'Sole matter of ^^^^^^ j^^ whether Bastery 

was attained. At an even; awre basic leve;, there are writers (Sipon, . 
1969) who argue that therie i$ no such thlrtg »s a crite^l^-refdrencfd 
test separate fro« a more traditional norm-refefeticed test; rather, 
the interpretatiSn one pUts^-on the score, (absolute number rather than 
relative ranking) is the jbasis for-tho distincttqn. 

Any. of these Viewpoints^ may have aerii ; Konever , for the purpose 
of this re^prt, a crit?eri|on-refercnced t<*st" (Cirij is defined as a test 
that measures performance! on a single behavioral objective, that has 
several items drawn from k well-defined uniw^se,, and whose^ results 
yield a diclvotomous mastelry/noomlbteiy decision with reference to a ^ 
predetexMned criterion l^el e^ressed as a percentage of it«M 
answered correctly. As s\ii^h, it comes closest to Roudabush's (1974) 
category of a pseudo-'cttnt^^uous measure cf a dichotomous true score. 
It also seems to fall into] Hilimanfs (1974) category of j?«U)AD, or 
criterion- referenced differential ^assessment device, aiiiWagh this .writer 
docs nbt^ agree with all th^ nuance's 6f implication of the CRDAD classifi- 
cation. Some of these areas of disagrocmeSt, will be discussed In the 



next chapter. * ^ 

It will also be shown in the next chapter that a CRT, as defined ' 
above, differs from the more familiar norm-referenced test in several 
fundamental aspects: purpose, test specifications, desired score 
distributions, me.thod of reporting scores, and meaning of reliability, 
among others. Thus the two kinds of tests 'are quite different and, 
although they share some properties, one kind is not, for example, a 
special instance or a generalization of the other. 
V * 

Overview ' ^, :^ 

This report deals with CRTs as previously defined> and its major 
focus is on the notion ol CRT reliability. Because the purpbses, con- 
struction, applicaticrr, and psychometric theory of CRTs are /considered 
by many to differ from those of iiorm-refereaced^tests (NRTs), serious, 
questions' have been raised in recent years as. to viiethex; classical 
reliability measures ought* to 'be applied to CRTs. : " 

,In Chapter II, several of these ques^tions ^ raised and inv<^ 
rigated, and an attempt is made to show that ctrrtsical reliability 
indices are not meaningful for kt least one important -aspect of CRTs. 
An extension of the classical mathematical model that incorporates -fhds 
aspect of CRTs is suggested and a definition of CRT reliability is pre- 
sented] Also Aiggested is a set of criteria for a; CRT reliability index. 

Chapter III is an exposition of coefficient beta, t.;e mean split -half 

♦ 

coefficient of agreement (Marshall Q Haertel, 1975) , a recently de- 
veloped single-administration CRT reliability coefficient. Thi" new 
coefficient is b&sed cm the theory presented in Chapter II and meets the 
criteria suggested therein. ^ 19 © ^ 



in Chapter IV» a few other CRT. indices that have been presented 
in the recent literature, including those of Livingston (1972a) and 
Harris (1972a), are discussed with eaphasis on how well they meet the 
criteria suggested in Chapter 11.^ In additiol^ other indices used 
in this study are defined. • " , ' . 

' Chapter V pre^nts the questions investigated in this study con- 
cerning properties of coefficient beta and its relations to other test 
indices. The statisti^cal BWthodology utilized in answering these 
questions is described, as is the coaputer progfB^j^Btf*^'^ jjenerate 
the simulated data for the^study. 

Chapter VI presents thie resi>l'crA f tfi^su investigation^ wia draws 
a nuBber of conclusions, ^iZ'd Ch^pu^r VIA offers a suw^ary and suggests 

areas for fiyture research, , ^ 

A . •• . . 

The' purpose of this report is to provide the educational aeasu-e- 

ment cilsRaunity with: ' / 

1. a brief suosaary of the rationale for questioning the applica- 
bility of classical reliability aeasures to CRTs. 

2. an extension of the classical theory of true and error scores 
to iitcorporate a theory of difiSjptoaous decisions. 

3. a dj^t ailed presentation of the aeaij, split -half coefficient 

of agreement, a new singlcradministration test 'index de- ^ 
- signed to ■casure the internal consistency of dichotowjus 
|. classifications. 
/ 4. systematic data concerning the properties, under varying con- 
ditions, of this now cocfficieiht and several other ^Ingle- 
adoinlstration test indices, as well as their interrelationships 



20 



In summary, this report offers the rationaie, the i^xpositlon, the 
characteristics » and the relationship to other test indices of ^ new 
coefficrent dc^^signcd to taeasure the dichotoiDOus decision-making rfe- 
liability of CRTs. - ^ 

V 

•• ■• 1 




21 



^ • • •• 

■ ■ ** 

CHAPTER 11 
J . REUTED TOST THEORY 

It is, appropriate to examine how a criterion- referenced, tost (CRT) 
(as defined in Chapter I) differs from a nora- referenced tost (NRT) . A 
nufflbOT of authors have discussed aspects of the subject using various 
definitions of a CRT (Brennan. 1974; GIaser» 1963; Glaser ^ Cox, l%8; 
fiambleton S Novick. .1973; Millnan, 1974; Pophaia Musek, 1969).' In 
this chapter, Certain parts of classica) test theory will be discussed 
briefly and extended to incorporate a proposed theory for CRTs. The 
discussion will include the interrelated topics of the purpose of a 
test, scor^'istributions, test specifications md item selection, the 
underlying mathematical model and er/ors,of easurement, an extension 
of the mathematical model, and the meaning of reliability. Since this 
chapter is not a treatise on measureracsnt theory as such, the - discission 
will not cover all areas- in detail but will instead focus on those 
points that bear on the arguments developed here* 

Purpose of a Test 
The fundamental purpose of an NRT is to differentiate aiaong in--, 
dividual? by assigning to each examinee a number, or cstimted true 
5Core> jin reference to the norms of the population for wnich the test 
is 4^sign€d, One's score on an NRT .indicates a level of achievenrent 
that is given meaning by comparison with the group at large; it is a - 



measure of reiative standing within the group that can be coaBurdcated 
via grade equivalents, standard deviations above and below the aean, 
stanines; cehtiles, ''grading on a curve," etc. (It is true that KRTs 
can be used for dichotoaious decisions— a person may bo selected for 
adaission to, a training prograa, for exaapie, according to whether he 
scores above a certain cutoff -point- -but- this cutoff score is chosen 
in reference to the performance of Qth^t candidates and thus is dif- 
ferent from the criterion cutoff score on a CRT.) , • 

Hot too many years ago a coiaaiissioner of education, in a public 
l>olicy address, indicated his'hope that »rlthin a- certain period of^ 
time everyone would be reading "up to or above grade level-*' When 'one 
considers that grade l^evel is another tens for-Rean, this coaaent re- 
duces to a proposal that everyone should be at or above average. Al- 
though' the statcwent is huuianisticaUy generous, it is statistical iy 
self -contradictory. . - . - . 

Given a well-defined behavioral objective> however, one could 
correctly make a state»er<t about everyone's perforning at or above a 
certain criterion level. Oixe cd Id aeasure this performance with a 
CRT as defined in Chapter I. The purpose of such a CRT is not to rank 
individuals or to report scores in reference to a aor», but rather to 
enablfe one to saake a dichoto»ous decision based on whether a given 
pupil is perfonsing on a given behavioral objectiv^e at or above a cer- 
tain predetenained level (as defined by a certain score or percent . 
correct on the CRT.) ThUs^the purpose of a CRT is different fr«8 that 
of an fWT; a CRT provid^ data from which to saake a decision on an 
absolute, not a relative, standard (see also Glaser, 1963.) 

23 



Scor e Oistritu tions ^ 
Since, as stated earlier, the jtarpo?- vf m *VRT is to di-scriminate 
ajtong examinees, one would naturally hope for a fairly oven score dis- 
tribution ^prith a ji^idc ^pr^ad of scores, ^.o as to aUow efficii?nt dis- 
crijaination- Thus, in iheor>% the? optima total score distribution 
for an NRT i-?ouId have soi-ae shape within the range of normal (with large 
standard deviation), platykurtic^ rectangular, or sUghtly bitsodal 
(with g5odos at the extreaes); Total score distributions of this sort 
usu a ly enhance test reli.^bi Uties since they produce ©oderately high 
>-totn;l- score variance. In p^|pctice, and consistent i^ith sKsst theories 
of traits within a population, a laxge-variar-ce horts^^or a Sv^etri- 



cal» sojaevhat piatykurtic, distribution often obtains. 

However, th*: ^ssxs^&px ion of a normal or a platykurtic distribution 
for cocapetence on a given behavioral objective^ is clearly contradic- 
tory to the reason for and purpose of instruction* The reason for 
giving instruction toward an objective is that students have not mas- 
tei^d it; one assuiaes that before instruction, student proficiencies 
are tasssfed for the sjost part at the lower end of the spectruja, 1 In 
teaching, one hopes that ail stiidents >rill master the objective. 
With individriaiized instruction sotne students may take a good deal 
longer than others, but ultimately the purpose of this instruction is 
to ensure that the aass of student proficiencies shifts to the upper 
end of tbtj scale. In neither case is a nonaal distribution iniplied. 



Here, and elsewJier^-^ this paper, attention is restricted to a 
pertain limited ty]?e of behavioral objective- -one that is quite 
specific and narrow in scope, usually froa .the cognitive dooain, 
and measurable by a test of several items. 



2-4 . ^ 



To quote B loos (1968),. 

If W are' effective^ iti <mx instn«^ion, the distribution 
of achieveaMsnt should be very different from th* nor«al 
curve. In fact, we my even insist that owr educational 
efforts have been unsuccessful to the extent to which bur 
distribution of achievement approximates the normal dis- 
trilttition {p. S]* 

With a CRT, «Doreover, used to laake a dichot<3acms decision vith 
respect to a predetemined critorioti^ th« desired discriaination is 
not'aaonjg individuals but rather between two mastery growps— those 
students who have act the objective and those who have not (Glaser 
Coj;, 1968. ) Hence the desired score distribution is one that is rather 
sharply biawdai, with one node well below and the other' a«ode rather, 
above the cutoff point (Roudabush, 1974). Research by Blatchford 
(1970) shows that these biaodal distributions do indeed occur in class- 
room testing. He coaaents, "In a diagnostic test, as [an exaaple of J 
a criterion- referenced test, there is no evidence of a noraial distribu- 
tion {p. 43] 

For a given adainistration of a CKT, particularly before or in- 
Eiediatcly after instruction, it is even plausible (and quite accept - 
iblc) for the set of scores for one of these aastery groups to be 
capty or very nearly so» producing ssall variance and hence distorted 
estimates ^of reliability by traditional Mans (Stanley, 1971). I^ch 
has been made of this point in the literature (e.g., Pophaa S Hu3elc,_ 
1969). Thus the need arises for a new definition of CRT reliability, 
sathat a test's reliability estisate is not adversely affected by a 
score distribution with s«all variance. It will also becpaw evident, 
in the next few sections, that there are additional difficulties in 
applying a traditional reliability estiaate to a CRT..^ 

•■25' ' ' ' \ , 



13 



Te st SptKlfications and ttga Selection ; 

Although a dewiled discussion of 'the swchaiiics oftest specifica- 
tions and itufs selection is not within the »copo of this report, cer- 
ta|n faceta of tpii practicaX topic shot-id be acntioncd. , 

In the construction of traditional tests, the doaain of ,th« test 
is- often. defined in relatively loose ti^ms such, as . ||Jgt;:j^fi^ biol^>^^. 
or rcadittg^cocprehensi on. or'^ iaatheaaticai aptitud^ . First av tff5t , - 
blueprint is prepared indicating in broad out I^i^thif-^ess^is and:" 
topics to be coyer«d: Then iteas am selected; if th«y 3 i t the te*t 
blueprim and if tbey" fail vir'.;n the purview of the subject ssatter, 
they are fai'STgaac forjnclusion in ^^hfe Initial version of the -t^t. 
Whetb*3r - they aro included in the .final version dcpeysds pjperferwiance 
on thea in, -the test' tryou^i t^ind .yhether t^^^^^^^ as a..>hole 

still fit the bluefTint) . Oecisions'^^rding an itea»s suitabilip^ 
for the final version are usually taade in -terms of its difficulty and 
either the it*^-tcst correlation (Gavis;_ 1952) or an analogous $t3- 

tistic C« ?•» Baker. 1%5). 

A CBT. on the Qth« : hand, has 5 decidedly narrower focus, de- 
lineated by the behavioral objective, and thus the i'tcjas adslssible 
for inclu^n in a test iryout mst aeet far sore festrictcd speci- 
fications. Stmts k^riters have claiaed that traditional aethods of 
item selection are therefore, inappropriate and have offered alter> 
rvative oethods ba-ied on fros ^ne to three lef? jrdainistrations • 
(Brennan. I97'i; Brennan S Stolurow, 1971; Cox 4 Vargas, 1966; 
Darlington 5 Bisiiop, 1966; tvens, 1970; Kosecoff S Klein. 1974; 
Millman, 1974; Pophajft, i97l; ,Pophaia S Husek. 1969; Wedean, 197>). 

- 26 



i4 



Ikfisltlve «nsi*ers to the question, of liiwr best to clioose item for 
am MX* still being sought, Iwt tlio wrk of those authors ii^lies 
th«t trtditibnal Methods probably f TO .not the soit||ion* 

The Mstheaatical Model «i«t Errors of -Meesureaent 
In dassicml test theozr* the nsual MthcMtioa aodel defines 
an examinee*! observed score «i < test as coaprisin^ t»#o coa|xmettts: . 
tnie score aj«l error. This "Odel is u»i»lly expressed by an equation 
equivalent to Xp • * E^, where p is the sulScript for persons. 
Here B is the error of aeasureMenr, the asBunt by w t tch a person's ' 
obtained score (X^) differs fro« hi» true score (T^), which in turn 
is the score that would have been pbtained with a (purely theoretical) 
perfect ■easuring Instnaieat or would have been derived fro« an in- 
finite m»ber of 'administrations of the test or parallel versions of 
it. This kind of model has been thoroughly discussed in the lit«ra- 
ture (e.g., lord $ Novick, 1968) ami will not be detailed here. 

It is important to note that sevexal assumptions are assoclatad 
Kdth this matheaurtical model and hence with its derived results. Three 
of those assueptions (Lord Novick, 1968, p. 56) are basic to the 
definition of classical reliabilit/ and are/ mentioned here, since they 
are scnitlnited later. These assia^frtions are that (1) true score and 
error have zero covarlance, X2) the expected vkue of error over per- 
sons is zero^ and (3) errors on parallel measurements have zero co- 
. variance. 

In classical theory, the basic question asked is, What is the 

r 

examinee »$ tnie score? True score is consi4«ired a continuous variable^ 
and is expressed on a scale that is usually consiaered to be interval, 

27 - ^ 



if aot ratio. Observed ssioit, expressed on the sw sc ale, is usually 
a polyto«)us rather than a contlmi6i^ariab»rbut only because of 
the nature and liadts of the measuring instnflwmt, »whlch ordinarily 
produces scores with integral valu««. Hence erjpor, like true score, 
is continuous; in absolute value, it is e^ressed on a scale that is 

ratio. • 

This continuous true-score saodel serves nicely »rtien.the purpose 
of the test is to determine as precisely as poi-itole what one*s true . 
score is and .o report that estiaated true score on a polytowwis, scale. 
But that is not the fxmdaaental purpose of a CKt as defined in this 
paper. Rather, the basic question asked by a CRT is, "Is the exaainee's' 
true score great enough to allow hia to be placed in the 'mastery:' 
classification?" Although the continuous true score is used as a 
Hfirst step in ajuwering this question, the final answer, or decision, 
is dichotooous and is reported on a scale that is ordinal but not 
interval. 

These facts suggest ar. alternative aodel— one of dichotcaous 
true and observed scores, with score in this sense waning decision. 
This model has been ladled Platonic ^Siitcliffe, ISBS) and is well 
summarited by Lord and Novick (1968, pp. 39-44). Although the equa- 
tion X • T ♦ fc is unchanged, elsewhere this model differs markedly 
from che cre-Jsical aodel presented above. Firsts true score and ob- 
served score, being dichotoaous variables, arc expressed not on an ir- 
terval scale but -on an ordinal scale, as is error, which is a tri- 
choto»ou5 variable (or dichotomous in absolute value). Second, Klein 
and Clear> (1967) have shown, aoong other things, that with the Pla- 
tonic true- score model, the covariance of true and error scores is 

23 . . 



16 



generally negative and is zofo only imder extraordinary circiastances.^ 
They also have shown that the' expected Value of Platonic error scores . 
.is not liKely to he zero/ and that errors on parallel tests caimot be 
a)qp«cted to haivo zero covarUnce, All three of these findings violate 
the assuB^tioas' upon which the derivation of classical test reliability 
rests « (Since covariance is a statistic designed for interval data^ 
one could question Mhy it |)iis been cooputed for a aodel idiose data are 
aeasured on an ordinal scale. "^^One could similarly question the coa|m*' 
tation of a variance or a correlation and thus the applicability of a 
classical reliability coefficient for dichbtoMous data.) 

The seaning of seasure«ent error is also different for. the two 
■odels. In classical theory, it is the exaadnee's true score and 
hence the size of the error present in the obtained s<aw«,thaJL^re the 
psychoaetrician's subjects of interest. The^taingd^ore , if th^ 
is error, can vary f rem the true score by a - Jc or Iry a little^ and . 
it Mkes'^a difference to th^^^ which of these cases 

holds. In the Platonic »ode|, honrever, there is only-xone kind of 
wasuirev^nt error-^incorrect categorimion* There i*TR^i^ or 
ssall associated with it; che psychooetrician is concerned with the 
existence, not the size, of error* This view has been stated succinctly 
by Cronbach and Closer (1965) as follows: ^'a test designed to be ^ 
saxii^lly efficient for a particular decision will freely allow errors , 
to enter if they ate irrelevant to that decision [p* IS?]." Others 
(Kanblcton ^ Novlck. 1973) have recognized* even without acc^titig 
the PlatoTtic tBodel, that there is only one kind of i»easttre»ent error 
for a CRT. 



29 



7 



ERRORS OF MEASUREMENT /(InDER TWO TRUE-SCORE WOOELS 



Student 


, -UAJ 

X 


T 


B 




Plati 
X 


>nlc Th« 
T 


E 


A 


15 


9.4 


5.6 




0 


. 0 




B 


16 


20,0 


-4.0 




1 


1 - 




C 


15 


19.5 


-4-.5 




0 






i> 


<^ 16 


10.8 






1 


Jo 




E 


15 


16.2 


-1.2 




0 


1 




F 


16 


15.2 


.8 




1 


0 


1 



Mareover^ classical and Platonic measurement error need not cor- 
respond for a given set of data» Consider the hypothetical data .a 
Table 1 for a 20-i?^ CRT with a mastery criterion of 80%, 



yielding 16^as the cutoff score. Of students A through all of %^honJ 
have an obtained score of 15 or 16, students A, B. C, and D have the 
largest classical measurement error and students C, 0, E, and F have the- 
largest (only) Plaitonic measurement error* Likewise, the studenti with 
the staallest classical measurement error are not necessarily those with 
a Platonic measurement error of 0. ^ - 

The table shows that* given the distributions of observed and true 
scores under the two models, there need not be a high correlation be- 
tween classical and Platonic measurement error,- particularly when ob-' 



EKLC 



30 



served scores are very near the cutoff score. Those data, of course, 
have been chosen to illustrate a point, and as obsfervod scoires begin to 
move amy froa the cutoff, the correlation between the two kinds of 
njeasurement error will increase. However, if the scores ■ove farther 
from the cutoff, the correlatiJf^^ will decrease. In any case, classical 
and Platonic «easure»ertt errors are different things, and the theory 
developed for one kind of error need not apply to the other. ^^^y 

This 6act raises the -question of which theoretical aodel is appro- 
priate, or preferable, for CRTs/ There are arguments for both m>dels. 
Those supporting the classical »odel argue that even if a CRT is designed 
to make a dichotomous decision, its initial results"(observed score) are 
reported on a polytooous scale. It is also felt that a dichotoaous de- 
cision "often hides the true level of student performance [Oeir^ 
Kosecoff, 1973, p.9]." Supporters of this aodel believe that it is 
just not realistic to claia that a person's true score on a behavioral , 
objeccive is an all-or-nothing entity. . 

Primary among the arguments si^porting the Platonic aodel is the 
belief that when a test is used to sake a dichotoiiouv decision— "go on" 
OT "don't go on" to the next behavioral objective— the\ize of the ob- 
tained score is iamaterial ex(Jpt as it results in a artery or non- 
oastery classification. It is felt that this 4ichoto«>ui score is the 
only one that need be reported;- further subdivisions of the obtained 
score have no practical value. "Such grada^tions in reporting [scoresl 
arc only a function of the aitemative coti^fses of action available to 

the individual after the ji9asurc»ents have been «adc tPophan $ HuSek, 

4». ■ ' ■■ .. •■■ , ..... 



11K>9, p. 81." 



31 



It appears that ohe Bust choose Ijetween the model that is consis- 
tent with colxtinuops true aiid error scores and the model that incor-- 
porates dichotomous decision and error'scures. The needi* 
between these tw5 models can be avoided (and,, it i?^ claimed here, should 
be ^voided) by broadening one's view of the meaning of true and observed 
•scores. The contention here is that classical true-score theory is ^ 
appropriate when the basic purpose bf a tesr is to estimate the- true 
score. But when the test has a different basic purpose, such as to de- 
termine a dichotomous classif ideation, then the examinee has not one 
true score but two, existing simultaBeously: a true score that is in- 
volved with the primal measuring ptocess and another that has to do 
■ with the decision or basic questi m to be answered concerning the in- 
dividual and thus with the/practical results of that measurement. It 
can even be said that theie are as many different sets of "true scores- 

■■■ • * . -^i. ... * 

as th(^re are alternate score-reporting schemes. 

assertion in this report is that a CRT as defined here in- 
-^QS two different facets of true score-:rpositional and operational. 
The first facet deals with the position of one's .test score in relatiort 
to the test scores of others; the second facet deals with the opera- 
tional effects of the test score on the examinee alone. Classical NRT 
theory concerns Utself only with the former and for good reason. When 
the end result of the. testing process is to associate the examinee with 
a number (when the test's basic question is what his true score is), 
then the, positional and operational fac«fccs are indistinguishable. But 
when the end result of the testing pro Jss is to make a dichotomous 
decisiC <when the basic question is whether the examinee merits a cer- 
tain cia«isification) and the outcome of that decision has an immediate 



and <ii£fet«ntiating effect on the student's next educational activity, 
then the difference ^tween those, iwsitional and operational facets 



caerges. 

The dual t rue- score- aodi^l for dtts is suMurlzed in Table 7. 



r 



TABLE 2 

SCHEMA FOR DUAL TRUE-SCORE MODEL 



Facet 


Basic Question to be Ansvered 


Bquation j 


"'" 1 

Scale of 
iMasver 


Positional 


What- is the true score? 


X « T + E 


Continuous 


Opeieratlonal 


Is^ the true score high enough 
to nerit "aastery** classifi- 
cation? ^ 


D - C + M*^ 


Dicbotottous 



*D - observed claaBlflcatlon (fteclalon), C - ti5e OLaaaification, 
K - Klsclassifl cation (error)"* 



Meaning of Reliability 
* Cl&ssical reliability can be defined as the squared correlation 
between observed and true scores (Lord % Novick, 1968, p. 61) . This 
statistic is equal to the ratio of tnie-score variance to observed- 
score variance if the conditions noted at the beginning of the pre- 
vious section are assuaed. The classical true-score aodel, presented 
hci^e as the positional facet of the dual true-score aodel. is consis- 
tent with those assusptions and therefore these definitions of 
reliability. However, the Platonic ■odb^ or operational facet of a 
CRT, is not fidnsistent with those assuaptions (Kl^in Q Cleary, 1967), 
and hence «»e classical notion of reliability cannot apply whenever the 



33 



21 

reliability of a test has to do with the consistency of decision making, 
i.e., whenever the basic ioeasuremfent question is to be answered di- " 
chotoBdusly. 

What then should be the meaning of the operational reliability of 
A 0rt For the positional facet, a test is, reliable insofar as an 
exaittinee receives the sane relative ranking on two sets of data (and 
in the case of parallel tests, the same score) ; for the operational 
facet, a CRT should be reliable insofar as an examinee receives the same 
classification on both sets of data. Put differently, positional reli- 
ability is concerned with the accuracy of assigning (polytomous) num- 
bers to examinees; operatibnal reliability (henceforth called CRT reli- 
ability) must necessarily be concerned with the accuracy of placement, 
in one of two Categories. , 

Consider the theoretical fourfold contingency table given in 
Table 3.. Classical reliability is defined in terms of a mathematical 
relationship lietween true and observed scores. It would be natural 
to begin to liTvestigat;e CRT reliability in the same terms. With 
reference to Table 3. one approach would be to consoler the squared . 
correlation between tiue and observed classifications, p'^CC.D). Since 
the variables are dlchotomous. this would imply the use of the squared 

TABLE 3 , 
A FOURFOLD TABLE FOR TRUE AND OBSERVED CLASSIFICATIONS 





Oba^cd Classification (D) . . 


r 




+ 






True 
Cla»olflcatlo<t 
(C) 


+ 


a 

- 




1 i 


- 




•d 












[ 1 



/22 



phi coefficient if the dichotomy is a.true one. However, the dual true- 
Hcoru' model presented prtvlously and the arbitrariness of the mastery 
cucoH score of a CRT su«gcst"that true classificatiftn is an artificial 
rather than a real dichotoay; and hence that the ph^^^\|ficient As 
not the appropriate statistic. (Nonetheless, the phi coefficient is 
calculated from a different fourfold table in the investigation pre- 
sented in Chapters V and VI.) 

If the dichotomy is artificial, then the tetrachqric correlation 
coefficient is the appropriate statistic and would yield a fortaula ^ 
a, b, c. and d. (See Tab re 3). The objection to^he cosine-pi' estimate 
of this statistic' is Ifcat if either a or d is 0, th\n the correlation is' 
-1 even though it may bo near 1 when a or d is merely close to 0. 
(Nonetheless, the cosine-pi cstiiwte of the -^trachorlc correlation 
coefficient is also calculated fron a different^Rurfdld table in the 
study presented later.) 

Another approach to the mathematical • relationship between C and 
p is the variance-ratio approach. As pointed out earlier, one cannot 
assume zero covariajice between true classification and misclassifica- 
tion (error). When this assuiq>tion is rejected, a true-classification 
variance/obtaincd-classific^lfSbn variance ratio of 

♦ b)(c ♦ d) , Tr(l - , ■ ^ 

\a ♦ c)(b ♦ ar P(l - P) 
is obtained, where it Is the true proportion of »a$t©ry classification and. 
p is the obtained proportion of r3-t*ry clas^fications. But this sta- 
tistic is unsatisfactory foi at least two reasons. First, if a « d or 
b = c, then r « 1 no matter %*hat :Mi»bers are in the other two cells; 

35 



23 



second. ^T^5 < n < p or p < n < .5, r > 1; which is clearly nOt accept- 
able, i f 

So it appears that for the true (C) and observed (0) classifica- 
tions in Table 3, neither the correlation approach nor the ratio of 
variances approach yields a satisfactor>\coefficient. Thus there must 
be some other mathematical relationship 'be«r\f(eejT C and D thiit affords a 
meaningful CRT reliability index. One such relationship, which follows 
directly from the notion of CRT reliabJ,Uty as' consistency of classifi- 
cation, is the proportion of classifications that are correct classifi- 
cations, Since a and d are unknown, it would seem thai a 
me-aningful CRT reliability coefficient would be a statistic that esti- 
mates, or perhaps is a lower bound for, this quantity. Furthermore, 
iny'such CRT reliability coefficient should have, so far as possible, 
the following characteristics: 

1. It should be associated with the notion of consistency or 
accuracy of (dichotomous) classification; hence the more the 
scores depart frojp the cutoff point, the higher tt»e CRT re- 
liability ii^ex should be, sintc such ?> departure most clearly 
represents a separation between the mastery and nonrcastcry 
categories. 

2/ It should be, at lea%i in some respects, variance- free, so- 

that it will not vanish when total score variance approaches 0. 

S. It should avoid any reliance on classic'^ 1 measurement error 
concepts, since they arc n6t necessarily relevant to a test 
whose purpose is to raake a dichotomous decision. 



3 



n 



/ 



4. It should be a function of the criterion level, since the 
criterion level is an integral part of the CRT as defined in 
this report. 

5. It shmild if possible have a faailiar range of values, swst 
probably [0,1 J, for ease of interpretation. 

A cofifficient that incorporates these features will be presented 
the next chapter. 






37 




CHAPTER III 

COEFFICIEOT BETA: THE MEAN SPLIT-HALF COEFFICIENT OF AGR^.EMENT 

. History and Rationale 
Sotne decades ago, the single-adroinistration reliability, or inteV- 
nal consistency, of a test was estimated by calculating th© Pearson ' 
product-tnoment correlation between two halves of a test, adjusted by the 
Spearman- Brown prophecy formula. Later, other split-ha3// fornulas were 
introduced (Flanagan, 1^37; Rulon, 1939). But there were objections to 
the splitlhalt' raetliod, since theNpartlcular test split chosen (usuahy 
odd-nurobered itcias versus even-numbered items) was not necessarily 
representative, and a.misleading reHability estimate could result. 
Other psethods were proposed and proved useful (Hoyt, 1941; Kuder a 
Richardson. 1937). Then Cronbach (1951) showed that his coefficient alpha 
was not only a generalization of the Kuder-Richardson formula 20 and 
equal to Hoyt's internal consistency measure-, but was also equal to the 
saean of all possible split-half' reliability coefficients ibut not equal 
to the mean of all possible step^ed-up split-half correlation coeffi- 
cients, see Novick 6 Lewis, 1967). Thus was established the basis for 
estimating internal consistency for a test designed -to rank-order the 

exaisinees. , . * 

However, when tl^purpose of a test is to dichotoraize i-.ner th^an 
rank-order, the procedure to follow is not so clear-cut (Pophan 5 
Husck, 1969), S.everal authors (Bergar, 1970; Carver. 1970; Goodman 5 
Kruskal, 1954; HsBbleton 5 Novick. 1973; MUlman. 1974) have suggested 



ERIC 38 



26 



using a simple coefficient for such test reliability, but only in the 
dual-admiliistration sense. This indf5X, given various naoes and symbolic 
labels by various authors, will here be called the coefficient of agree- 
ment and, for the sake of sit^licity, labeled P. According to Goodman 
and Kruskal (1959), this measure of association was reported as early 
1884, although it was not used for test reliability. The suggested 
index simply the proportion of individuals who are classified the 
same way (mastery /mastery or nonmastery/nonaastery) by two sets of data 

_test-retcst or parallel fonas. The coefficient has not been adapted 

i 

to the split-half single-administration case, perhaps for the same reasons 
as Those cited previously for the classical split-half coefficients. 

However, Cronbach's (1951) finding suggests a lead: one can con- 
sider an index that would be equal to the mean, of all possible split- 
half coefficients of agreement* To extend the analogue >ith Cronbach*s 
coefficient alpha, this index will be labeled coefficient beta (6) • 

Definitions 

Let 

N = the number of people taking the test 

n = the number of items in the test 

Xp« the pth person*s total score, p « 1, H 

c » the criterion level, expressed as a fraction (0 < c £ 1) 

k » the smallest integer > ^ , and hence the minimum number 

of iteins in a half-test^ that must be answered correctly 

^ For now, only tests with an even nurt)«r of items are considered. 
Tests with an odd number of items are dealt with later in the chapter, 

39 



to receive a ''raastery'* classification on th« hatf-tej^t 
^Ip* ^2jt ' pUi person',s scores tor. the? two half-te^ts, 

Ip 2p p 

There are |^n/2^ ^ possible test splits for an n-itm test if 
one considers each half to he labeled (i»e, , for a tvo-^itea tc$t th« 
split I / 2 is different fro«i the split 2 /. 1 ,) For each pair of ^pUi- 
halves, constract a fourfold sastcry {♦) / nonaastery (-) contingCQcy table 





♦ 






A 


B 






C 


b 










N 



and define 



P = 



A t D 



Then ? is the mean of P taken over aLl v possible spHt-halves (r}: 



V ^ , s 



But * is the nurobcr of consistt^nt classif ic^itions (assong the S 
persons} on test split s, and hence can be written 



A ♦ D = 

s s 



p^i 



40. 



28 



wherte ^ 1 or 0 as the pth person's clusifloitions «r« comlstent 
or Inconiiattnt, respectively, cm t«$t spUt s. thu» ft am be wjLtten 



• s p 

" V K . 

Thus B is al50 the »ean (over |>er3bfis} proportion (over test splits) of 

consistent classifications* 



Analysis of the Coefflcieiit 

For any given test, the set of |WSfible scores for an indivi^iual, 
is {0, 1, .... n>- For «i^)ut«tional purposes this is pmrtitioned into 
five subsets, one or pore of »fhich My be eapty for a particular n 
and ki 

• {k 2k -2) ^ 

Sj « f2k-l) 

« {2k, .,-e J ♦ k-ll 
Sg • (y ♦ k, n)* 

(Note that k 1 implies • ( and k « ^ ivplies « ( ).) 
Then consider scores in each of the five sid>sets; 

■■ • ^ ■ ' 



29 



! 

\ 



1. For X C S, , XJ, < Tiiiis Bastery on a ha If -test cannot be 

P ^ P 

obtained no matter how the test is split, since both X^^ and aus't , 
necessarilx be less than k. Hence all persons with c Sj^vill contribute 
't<ji D, as defined .in the contingency .table above, for all v test splits. 

2. Foy X c S,, k < X < 2k-2. Here soae splits wall contribute 
to B or C (for cxaaple, Xp « k*l; X^^ • k, X^p » i) and some will con- 
tribute to D (for cxaaple, Xp » 2k-2; X^p • » k-r . The obvious ques- 
tion "Wiich splits?" becowes a problem of coE^inatorics. Since only 

A and 0 enter into Equation I, one need not be concerned with contri- 
k lions \o B and C. (These oontributions will be equally divided anong 
B axfd C because of the sysaaetry Scoiplied in "labeling*' the h^ilves of 
the test.) " 

' The question then reduces to **P0t k score of X e how many D« 
categprizations vriil result?'* A D-categorization will happen then 
neither half -test is mastered and thus both X^^, X^^ t,^^^^ 



, Define X. and X- as vector!? of 0»s and l*s, indicating in- 
ip /p \ 



vector!? of O's 

correct and correct responses, respectlveiy, to items on each half-test. 

If one vector has k-l Ps, the other has Xp - (lc-1) l*s. Moreover, 

since X^ c and hence X^ < 2k-2, it follows that • Oc-l).< k-l. 
p / P - P 

Thus one is interested only in those pa^irs of vectors in which the nuRsber 
of I's in each is between these two limits, nausely X^ -Ck-l) £ both 
Xjp» Xjp < k-l. Moreover, since in the total -score there arc X^ I's, 
there are n^X^ 0*s* In the half- score, if there are j l*s» there are 
y - j O's* Thus, for Xp c S^t we can pick pairs of vectors th^t will 
yield D-cat^orizatlons in '^r^ ill I! " ways. 



30 




3. Pot r S^. • tt-l* Ttw* tb* MSt "taUae«d** split mHI 
yield k l*s in cnie wctor and k-1 l*t in tiM otlMir, indiCAting MS- 
tejy in th^ first cise noii««steiy in th« sMond. Othm, l«s» "b«l- 
imced" splits will yieldNwre «xtr««« alloe«ticmt of I's, wsulting 

in the saae lusterx/noiuustery clusificationt. Tim, for ftU' 1^ c S^, 
no sp^t contributes to A or D. 

4. For Xp e S^, 2k 5 X^ < J ♦ k-1. 1hl« ctse Is si«i2«r to thit 
Sj. So»e split* will contribute to B or C (for •xniile, X^ ■ 2k; 

*lp • S • '-'^ contribute to A (for «c«ple. 

Xp " 2k; » X^ - k). Since X^ > 2k. it ca»iiot bo tliat both Xj^. 
X^ < k, and hence there are no codtributiont to D. Agaid m ignore 
the contributions to & and C, But should focus attention iastffitd on 
the contributions to A. 

In this case, one needs to count those vectors where both half- 
tests are sastered, i.e., tdiere .both Xj^, >^ k. ^^^^flSoa half-test 
vector contains k 1*9, the other contains X^-k l*s. iut 1^ c iis- 
plies X > 2k, which iaplies k < X^ -k. thus on* is interasted only 
in those half-test vectors such that k je both Xj^, ^ ty using 

reasoning identical to that for S,. tha total ttudiar of splits that 
will contribute to A for X^ c 



S. For Xpc S^. Xp 2 * ^* This siqrs that half tha item plus 
at least another k itew are ansvered cdrrectly, md thus both X^^, 
^2p - ^ "° utter how the teit is split. Heaea all v i^lits con- 

tribute to A» 

43 



31 



The coefficient 

The above analysis yields an equation for S, the «ean split-half 
coefficient of agreement. For in each of the five subsets, define 
the following functions «^(X), i ■ 1» .... 5: 
I. for 0 j( X < k-I ♦^(X) « i 

i. X - 2k - I ♦jCX) -0 



X-k 

*• 2k < X ^ n/2 ♦ k-1 ^^(X) « 



« 

S. n/2 ♦ k < X < n tjCX) - 1 

Here. ^.(X) is the proportion of splits that contribute to A or D 
for a given score X. 

Then Eqwtion I can be revnrltten . 

, N p.l i P • 

where the index I depends on the value of X . Hence 8 has range 
[0,1 J; it is 0 when all X^ c Sj, and 1 when ^ ^5- 

Although Equation 2 suns up the analysis rather sinply, it is in- 
efficient -for cowputing purposes. A more efficient method involves gener 
fttlng a frequency distribution of total scores and computing ♦j(X) pnly 

once for each possible value. In general, let. f^^ be the frequency of 

n 

score X,- x ■ 0, ...n, Z f _ ■ Nj. iThen 

X»0 * f 

44 ^ 



nhere again the index I dep«n<^^ or. th.^ ^^^y^ ^* 

More cKplicltly, since for som^t . aJ ^es of X» ^^iX) * 0 or 1, 




Adjustment for odd n 

For an odd number of items, a test split is defined as resulting 
when one item is deleted and the remaining itei^ are divided into two 
sets, each containing items. In this case« k is the smallest 
integer greater than or equal to ^^^^^^ . The item to be deleted 

45 



.may be chosftrfelin n ways, each yielding a distinct sot of n-1 lte»s 
to be split. Hisnce there are n ^^„.^/2) P^^ 
if one again considers each half to be labeled; 

For person tp, with total score X^, the response yector Xj^ con* 
taiijs Xp I's and (n-XjJ O's. Thus, for person p, ' Xp of the n 
possible choices of' the item to be deleted *itl I insult '1^^^^ 
itess containing (X| -1) l»s, and n-X^ choices will result in a set 
containing Xp I'sv, ^contribution to 0 for 

p, rather than ♦jCXp). *fill be ^'^'^^ ♦i'^^^ 

hence, taking the iiRM||^r persons, 

M : I ' . ^.y- . • , • V 

0 rSTpf, f^,>i#V * (n.Xp) i^CXp) I. 



As before^ It is necessary to cdBipute i^j^(X) only once for each pos- 
sible y*tue of X. 

Al\o as before, the computation is wore efficient if we utilize 
the ' frequency distributiotit' of total scores. Recall that for a score 

of X on n (odd) items, for n-X^ choices of the item to be deleted 

p . P 



the ^tal score on n-l items will remain at X^, and for X^ choices the 
total score on n-l items will be reduced to X -I. The effect is thkt 
of a transformation, — - — on the set of total scores. In syAbols, ^ 



X -i-* X in of the cases; * 

* n ■ ;., 

X in ^ of the cases, 
thus a total score of X is arrived at with frequency a, 

; n n-1 

anS^i * 2ll f . - 0, and therefore I - I . * 
*n n n n n*l ^-o X«0 

;n-l, n 

Furthomore, it is easily shown (see Appendix A) that I g ■ ^ ^x' 

■ . x«o. x»o 

Thus tibking the mean over the transfonaed frequeficy distribution of 
total scores, coefficient beta is , 

X"0 .r^ 



where once agaih the ii;dex i depends on the value of X. Thus, i'n 
practice, the computation of 6 is identical for the -cases of even and 
<odd n, except that in the l^ittcr case one first pcrfonas an additional 
st^p, replacing f^ by" ^""^^^x ^ ^^^^W for X - O; 1. ....n^l 
and tjtien using n-^l in place of n in the computations of k and 



35- 




Technical Characteristics of Coefficient F^ta 

Although coefficient beta is defined solely on the basis of 

fourfold contingency tables^ its computational formula (liquation 3) 

is a function of the score distribution as well as of the, number of 

items and the criterion level. Since' these latter two parameters are 

(or should be) known before a test is administered, the value of B 

for a particular tryout results from the frequency distribution of 

2 

total scores. The same is true of values of Harris's and the 

criterion-referenced index of separation (S^) , which are discussed in the 

next chapter. Like S but unlike y_, 6 is the mean of its additive 

c. c 

parts. That is, given 6* for a set of scores of. N-1 examinees, if the 
score of an Nth examinee were to be added to the set, a new 6 could 
be calculated from ' * 

e « i [(N^i) e' ^ ♦iCyi. 

since from Equation 2, 

N-1 

(N-1) 0' = I ♦iCX ). 

p=l ' 

A similar argument holds for the addition of a set of scores. 

Since this additivity is a property of coefficient beta, one can 

investigate the relatiye contribution of the pth person's score to the 

value of the coefficient, given the number of items and the criterion 

level, mgrely be determining 4»^CXp).. For illustration. Figure 1. shows 

these relative contributions for a 20 -item te.st with criterion levels of 

70% and 80%. Additional graphs, covering a range of numbers of items* 

and criterion levels^ can be found in Appendix B. 



ERIC 



r 




0 1 2 " 3 4 5 6 7 B 9 10 11 12 13 14 15 16 17 18 19 20 



Figure 1. 4>j^(X) for a 20-itcm test; two criterion levels 



49 



ERIC 



It is apparent froa Figure 1 (as well as fron the analysis of the 
coefficient presented earlier- in this chapter) that as scores approach 
the iivteger inwediately below the cutoff, they contribute successively . 
less to the value of 8; at the score 2k-l (with k as defined earlier), 
the contribution is zero. This is to be expected since the score 2k-l 
composes the subset Sj as defined earlier "and ♦jCX) ■» 0. 

Figure. 1 might be misleading in the sense that, in these two 
examples, the point 2k-l is one le^s'than cn, the product of .the cri- 
terioa level and the number of items, and hence is one less than the 
testes cutoff score. One might therefore ask why 2k-i, and not 2k, is. 
the score with a zero contribution to coefficient beta. It should be 
pointed out, however, that this relation does not always hold. On a 
12-itea test with criterion level of 75%, for example, the points 2k-l 
and cn ar^ both 9. In general, if cn is an odd integer, 2k-l » cn; 
if cn is even, 2k-l = cn-1. If cn is not an integer, 2k-l can be 
greater than ca (e.g. if n « 16 and c = 80%, then cn » 12.8 and 
2k-l 13.)- In general, depending on the values of c and n, 2k-l 
falls soraewh^e in the half-open interval [cn-1, cn-»-l), 

Eveythough 2k might at first glance seem to be a more appro- 
priate c^didate than 2k- 1 for the score with zero contribution to coef 

ficient beta, 2k falls in the interval [cn, cn*2) , and the/efore, in a 

* ■ ■-• 

mathemat|«tr^cxpcctation sense, is not as good an approximation to cn 
as 2k-l. V 



50 



38 




Discussion t < 

Although attention in thiirTtrsscrtation has been «iven to critcrion- 

refcrencod tests, it should be pointed out that coefficient beta is ap- 
plicable whenever reliability is viewed as consistency of classification 
or consistency of decision-making ba^ on scores froa a measuring instru- 
i»ent, provided that the classification decision is based on so»e sort of 
cutoff point expressib e as a percent of itcss responded to in a certain 
Banner. 

Second, and consistent with the notion of accw^cy of categoriia- 
tion frocs the results of a Uoited ttumber of items, it sKould be noted 
that coefficient beta increases as the ninAcr -of items on a test in- 
creases, as shown in a later chapter- The degree to which ^his increase 
follows the Spearman -Brown prophecy for«uU is discussed In Chapter - 

Third, one should also not© that if ©xasninecs respond randomly to 
the items o*» a test, the resulting coefficient beta is not zero, as 
might be expected with a traditional rciiabili4</ measure. In fact, de- 
pending on the values of c, n, and N and on the nu3aber,of options per 
item (assuming a multiple-choice test),, coefficient beta would probably 
take on a rather high value, possibly even I. Froa the standpoint of 
^traditional test theory, this is disconcerting. Yet it is understandable, 
(rom the CRT standpoint, if one recalls that coefficient beta is designed 
to measure the operational reliability of a CRT: if all exaainees respond 
randomly to a test, i^is a clear indication that they arc about as far 
from mastery as is possible. The high value of coefficient beta would 
indicate th.it the test is classifying most of them as such, and reliably 

51 



39 

J 

SO* Nonetheless, a test constructor might want additional test tryout in- 
fonaa^-tion before passing judgment on the instruiaent * s. reliability , as in 
the construction of an NRT. 

Fourth, it is appropriate to 5<?e how coefficient beta treasures up to 
the criteria for a CRT rcliccilv^y coefficient that were set forth at the 
ciJ of the last chapter. 

K Coefficient beta is based on the notion of accurate placement in 
categories. It :ums out that beta does attain its highest values when 
the test scores depart from the cutoff; however, these scores need not be 
at the extremes for beta to take on its highest values* For example, on 
a 20-lteni test with a criterion level of 70% (yielding a cutoff score 
of 14), 6 «= I if all scores are in {O, ,6}U (17, ..-,20}* As the 
total scores pile up near the cutoff, the value of B decreases. 

2. Coefficient beta is variance- free in the respect deened most irr^por- 
tant by critics of a variance-dependent CRT reliability coefficient: it 
can take on any value from 0 to 1 even though the total score variance--^' 
is 0, depending on the relative values of the cutoff score aud the (single- 
metnbered) set of test scores. The coefficient is, however, variance-de- 
pendent in other respects. As the variance approaches its maxiinua, 6 ap- 
proaches 1- This relation is reassuring since maximuia variance on an n- 
itetn test occurs only when scores are equally divided between 0 and n, whic 
scores indicate the clearest possible separation of cxaoinees into two clas 
ifications. Furthermore, if S = D, then the variance is zero. These rela- 
tions are easily summarised: if the variance is high, coefficient beta is 
high; if the variance is low, there is no restriction (within its range) 
on coefficient beta; 



3. CoefficicBt beta is not based on traditional aelsureaent error 
concepts. Since it is built around the theory of dichotoaous categoriza- 
tions and Platonic true scores, the Platonic notion of uisclasiiflcaiion 
is the only mcasurcncnt error involved. 

4. Coefficient b«ta j.s an algebraic function of the criterion 
level (and other paraaeiers) . 

5. Coefficient'beta has a range of Jo.ll, although values near 0 
occur only under highly iaprobable conditions. 

f Coefficient beta and trichptonous data 

The authors of soae coajaercial, instructional programs, such as 
Developing Mathcinatical Processes (DW Resource Manual, Topics 1-40, 
1974), contend that mastery/nonnmstery alone is not a sufficient categor- 
ization of test results, and that nore valuable inforaation and more ap- 
propriate teacher options bccoae available if the iest result data are 
tridiotoniied into classifications such as "aastery," "progress," 
and "nonmastery." Coefficient beta, as outlined above, is clearly not 
sensitive to such, a trichotomization scheme.. 

The trichotofflous coefficient of agreement in such a situation would 
be equal to 

P - , 

based on the following table, in wnich * , *, and - stand for the three 
categorizations: 



53 



.41 



♦ • 



♦ 


A 


■ B 


C 




• 


D 


E 


F 






G 


H 


I 






N 



A coefficient analogous to 3 and applicable to this situation should 

" V A ♦ * I 
be equal to - T — 5 ^. , or the &can split-half trichc-- 

tomous coefficient of agreement. 

Such a coefficient can be derived, although the derivation i& not 
presented here. The analysis of this coefficient, althcagh tsore compisx , 
in places, is essentially parallel to the analysis of coefficient beta 

presented earlier. Instead of partitioning the set {0 n) into five 

suVsets, one partitions it into seven. RecaH that for coefficient beta, 
k is the ainifflua nuaber of iteras on a half-test that isust be aJiswcrod 
correctly for a mastery, classification- If, for trichotoaised data, one 
in addition lets I be the .uinitnu.'a number of itea's on the haif-te§.t that 
must be answered correctly for the middle classification, then .thf seven 
subsets of (0, .... n }, together with their corresponding- values of • 
<ji^(X) , i « 1, .... 7, are 



*,CX) - 1 




= {2£-l} ^jCX) = 0 



51 



• IZt 2k-2) 





where 0 < t < k f | , 

Kot« th«t t * i i«pUes $2 ana k * jJ«Piie*^_S^^^ l 

A5 before, the coaputatlon is aade fliorc efficient by utilizing the 
frequency distribution of total scores, and hence a forwula for Bj, the 
»eai> split-half trichoto*ous coefficient of agreement, is 

Since ❖j(X) 0 or I in fcpux of the -teven cases, this can be wore 



explicitly rcvritten 



as 



X«2k * " 



n 



55 



43 



where 




«x>d ^nd ate as above. 

The t\ichoto»ous coefficient requires the saae adjustments for 
an odd nuabcr of items as does the dichotooous coefficient, except that 
n-1 is used in calculating i as well as k and ^j^(X). 

Note that if the test is inultiple choice, the lower of the two 
criterion levels should not be set near the -percent of items that should 
be answered correctly ducto chance, as this would result in unreliable 
ci_as s i ^ication decisions bet ween the lower two categories. In this case, 
if taere are a significant number of nonoasters in the population, the 
value o£ i x ^ould tend to be rather low, as would be expected. 



CmPTER IV 

(mffill SiNGLE-ADMINISTKATl JN COEFFrcIENTS. 

Several authors have recently devised or resurrected indices deal- 
ing either directly or peripherally with CRT reliability. Some indices 
are based on one administrati'^n of a test (Harris, 1972a; Livingston, 
1972a; Marshall, 1973), some on two administrations (Berger. 1970; Carver, 
1970; Haufclcton 6 Novick» 1973; Ivens^ 1970; Millaan, X974; Ozenne, 1971; 
Swaminathan, Hambleton, 5 Aigina, 1974), and s^me on three administra- 
tions (Brennan, 1974). This report is concerned solely with single- 
Bdainistration indices. 

The two single-admiriistration coefficients- that tiave received the 

2 2 
Videst attendtion are k^^^ (Livingston, 1972a) and (Harris, 1972a). A 

third measure is the index of separation of test jscores (Marshall,^ 1973} . 

these and three other coefficients are presented in this chapter. Since th< 

relation of each of these indices to coefficient beta is detailed in a 

subsequent chapter, 'their rationale is discussed briefly here, as is- 

their degree of adherence to the criteria given at the close of Chapter II. 

Livingstones Crit^rion'Rcferenced Reliability Coefficient 

2 . - 

Livingston's coefficient > k is widely known and the most cxs- 

cutsed coefficient in the recent literature. It stems from an interest- 
ing appiicatioTi of classical reliability theory, and departs rherefron 
only in the notion of niean square deviation. Instead of using ^xriance 



46 



as the acan square deviation from the nean of scores^ Livingston subst^.- 
tutes for it a quantity equal to the aean square deviation from the cut- 
off point. The assuicption is that the deviation of a person*s score from 
the cutoff, not the deviation from the mean » is of interest. in a CRT. 
The rest of Livingstones careful algel>raic development parallels^ that of 
classical theory, and the resulting k^^^ is ^elated algebraically to 
classical quantities: 

,Z ^ ro^ * (X>C)^ ' ^ ^ m 

where 

r = classical internal consistency reliability 
o =^ variance of total scores 
X = mean of total scores 

C = criterion cutoff point (not necessarily an integer). 
As can be seen frorc Equation 4 (and as pointed out by Livingston) , 
- ^' ^PP^^^^^^ ^ ^ approaches C. 

This coefficient has been the subject of much criticism, comment, 

and rebutta-S (Hambleton & Novick, 1973; Harris, 1972b; Hsu, 1971; Living- 

ston, 1972b, 1972c; Kfershail, 1973;^ Ozenne, 1971; Rzju, 1973; Shavelson, 

Eiock, 5 Ravitch, 19723. Summaries of the arguments can be 

found in the references by Brennan -(1974), Rim (1974), and Wedman (1973), 

2 

and are net pieseritcd here. In this section^ k.^^ is analyzed with re- 
steer, to the criteria, for a CRT reliability index %et forth at the end 

ef '.:h::'^^er II, 



Er|c 5S 



1. It is not the distances of the scores ' themselves ;from the cri- 

terion cutoff that contribute to high values of k^j^, but rather the 

distance of thV mean of scores from the cutoff, as Equation 4 shpw^. 

This fact is of ^o consequence When the score distribution is unimodal 

and generally s^irane^ric, since under these Conditions the mode and mean 

will tend to coincide. But when the distribution is bimodal, which is 

desirable for a CRT, then this fact becomes important in interpreting 

k^^yj. it is particularly 3^^03C^nt when the mean falls about halfway 

between the two modes. Ccmsider the earlier exiample of a 20-item test 

with a cutoff of ^14. Suppose t^e data from two samples, A and B, form 

"inverted triangular" distributions with different means,, as shown in 

2 

Figure 2. If the classical test reliability is .80 in both cases, 1^ ^ = 
.91 for sample A and .80 for s^le B, even though sample B seemis to show 
a clearer s^eparation between nonmasters and masters; since there are 
fewer scores at^r near the cutoff/ (Coefficient beta would have 
of .72 and .88 for samples A and B,, respectively.) 




B 




O 3 



OH- 




Figure 2. Two hypothetical score distributions. 



59 



48 



2 ■ ■■■■ ■ -'^ ' ■ ' ■ '-i-- ■■ ■ - • '. ^ ■ - 

2. The coefficient is not variance -^Iree^ is evident froa 

Equation 4; it is dependent on totat-score vailahce/^^^^^^ 

coefficients^^ although in a different way. Khen^total score, variance 

is zerp, '^jx ^ ^ (see Equation 4)/ unless X » Cv Thus the^^w 

. - ■ , "« r-- ..'^ 'i. - ■ ^. \ •.■ 

does not vanish when the variance approaches zero^ but instead it tends ; 

toward a unique value. ' When the variance approaches its maxiioumj ;k^y 

"again approaches 1 because the traditicmal reliability cioefficient also . 

approaches 1 under these conditions and )sp^^^^« Under less extreme 

' -v' Z ■■■■■ '• ^ * 
conditions total score variance has varyirig effects on Jc^j^ • 

3., The coefficient k^j^ is based oiv clasisical' error of measurement 

In f«ct, as Harris (1972b) points out in criticizi'^g the coefficient, 

the standard error of measurement is the samq in Livingstvin's framewoirk 

as it is in the classical framework « even though the value pf Livingston **s 

reiiability^coefficieht is normally higher than that of a classical 

coefficiM^. 

A. As Equation 4 shows^ k^^^ is an algebraic fi<nctiori of the cri-, 
teripn level (and other parasaeter:^) . 

5. The coefficient has the familiar range {0,1]' under most \ 

' - : . 

conditions, although it is theoretically possible for it to take on nega- 

tive values, when the claa^sical internal consistency estimate is negative 

t^ml the test mean is at or very nearjy at the cutoff. 

Harris's Index of Efficiency r 

I ' 2 ' 

' The inckx of efficiency, u » propow*? hy Harris (i972a) is intend- 

ed "to examine how well the test sorts defined sonpIesNrif students into 

categories ♦and possibly to measure Hs efficiency in this sense {p. 4. J", 

60 



It has been interpreted as the squared correlation between test score 
and a 0/1 dunimy variable representing the nonmasteiy^ 
fications. Harris also points out that can be conceived of as the 
ratio of tirue-jcore variance to observed-score variance if true score is 
defined for the subjects in each of the two groups as the group mean. 
The computational formula is 

^ SS^ . SS^ 

where the terms in ^he ratio represent the between-group and within-group 

sums of squares for the groups resulting from the dichotomous classifica- 
tion. 

The index is analyzed as follows with respect to the CRT reliability 
criteria. 

1. The index of efficiency has highest values when the total score 

distribution is sharply biraodal with a mode on either side of the cutoff. 



but these modes need not be fqr from^he" cutTTf f7^^^^^ 

the 20-1 tem cest with C « 14, =t 1 if all scores are either 0 or 20, 
which is reassuring. But y^. is also 1 if all scores are 13 or 14; a per- 
fect v' occurs even though all mastery/nonmastery classifications could be 

c . — r 

reversed with a change of only one point in each person *s total score. 
(Coefficient beta would have a very low value under these conditions-- ^ 
less than 0.20 if the scores are more jir less evenly divided—and Liv- 
ingst6n*s k^^ would be ho greater than .Sr *^ .5, where r is a classi- 
cal reliability coefficient.) 

61 



^2. The ifs^i^f efficiency is variance-dependent, but in a some- 

Ii^^5%?*^iff^ way than a classical coefficient is. As Equation 5 in- 

^ dicates, is. undefined when^ total -score variance is zero; and when ' 
c 

V 2 2 

total-score varrance is at its maximum^ u =1. But y can also be 

c c . • 

- • <• 

high even though tho variance is small (but not zero). Givfn a 20-item 

test with a cutoff of 14, « 0, if all examinees score 14 or 15; if 

■ ^ ■ 2 

one examinee scores 13 and the rest score 14, " 1. 

3. Except for the true-variance/total-variance ratio interpreta- 
tion mentioned earlieiji the ind£>x of efficiency is not based on tradi- 
tional measurement error concepts. (An example of the index's departure 
from traditional measurement error concepts was given under point 1.) 

4. Although not explicitly part of the computing formula, the 

A 2 . 

criterion level is nonetheless implicit in the calculation of u since 
it is the basis for defining the two groups into which «the examinees 
are sorted and for which the suras of squares arc calculated. 

5. The index of efficiency has the familiar [0,1] range. It is 0 
wh^n all examinees are classified the same way (provided variance is not 
0); it Is. 1 when there are two groups and each within-group variance 
is 0 (see Equation 5)- 

ir 

The Index of Separation 

The index of separation of total scores (S) is- designed to measure 
the ticgrce to which the set of total scores on an n-ite« test approaches 
the set (O.n), It i^^sed on the assumption that the population 
lakiuK a CRT is in fact ]thc union of two ifjubpopuJatioiis. either of 



63^ 



which may be empty: one knowledgeable, and hence with expected test 
score E(X) = n; the other not knowledgeable, with E(X) = 0, either ^ 
when the test is free- response or when the scores are corrected for guess- 
ing. The formula for this index is 

nN ^ . 

where n and N are the numbers of items and persons, respectively, ^ 

and Xp is the pth person's total score. 

An -alternative formulation for S is 

o 4 y |.n V ^2 
n^N P ' ^ 

If this is rewritten as 



S 



2 

n 
4 



it follows that S can be interpreted as the ratio of A to B, where 

A is the mean square deviation of the Xp from n/2 and B is the max- 
imum possible mean square deviation from n/2 Cand hence the maximum pos- 
-sible variance for _a^.test of n items.) 

The index can be analyzed according to the CRT reliability cri- - 

tcria as follows: 

1. The index of separation has maximum values insofar as scores 
depart from n/2 rather than from the cutoff. Thus S is a score distri- 
bution index and is not critcrion-dcpcndent ; this is also clear from 
Equatldn 6. 



63 



52 



2. The index of separation is algebraically related to total score 



variance by the formula 



1 . 4(p q - -| ) . 



n 



Ex 



where p is the mean item difficulty (i.e., -jj ) and q » 1 - p. None- 
the less, S is variance- free in the same important respect as coefficient 
beta is: it can take on its full range of values even though the total 
score variance iV zero. Also like coefficient beta, S « 1 when variance 
is at its maximum, and S « 0 implies ^gjK)' variance (when the set of total 
scores is {n/2}j, 

3. The index of separation is independent >(>f classical measurement 




le criteria 



error concepts. , ^ 

4. The index of separation *is net a function of the criterion 
leyel; it is a function of the frequency distribution of total Scores 
alone. Its value for a given score distribution is therefore invariant 
under changes in <he criterion level. Thus it is a score distribution in- 
dex and not a CRT index. W 

5. The index of separation has range [0,1]. It is 0 when all 
scores are n/2. and 1 when all scores arc 0 or n. 

Since the index oT s^parath>i^ 1 and 4, 

it may be helpful to introduce a related index that satisfies these 
criteria. Such an index, the critcri6n-rcferenccd ind^x of separation 



(S^) , is formulated as follows: 



N 



c ) 



x>c 



[7] 



64 



S3 



where is the frequency of score X in the score distribution. 

Apperfdix A demonstrates tha. = S if C = n/2, and thus is 
a generalization of S. The criterion-referenced index of separatioiji meets 
all five CRT index criteria. >Thus coefficient beta will be compared 
with it as well as with the Livingston and Harris coefficients in 
Chapter VI. 

O ther Fourfold Table Test Indices 

In the analyses reported in Chapter VI, reference is made to two 
other indices besides those CRT coefficients discussed thus far. In 
this section, these other indices are described. 

First, consider again the definition of the elements of the mas- 
tery (+) / nonmastery (-) contingency table: l 





+ 








A 


B 






C 


D 













and recall that coefficient beta is equal to the mean of all possible 
split-half coefficients of agTeeraent, where the coefficient of agree- 
ment, P, is 



P = 



A + D 



The co»ine-pi estimate A correlation statistic, appropriate 
when the two (inherently continuous) underlying variables have b*een 
artificially'dichotomiied, is the cosine-pi estimate (r^^^pj^) of the 
tetrachoric correlation coefficient Cr^^^.) • A computing fgnnula # 



\ 



65 



54 



where the angle is expressed in radians and the symbols A, B, C and D 
refer to the entries in the contingency table above. This fonaula 
yield? a good estimate of r^^^ only when the marginal frequencies of 
the contingency table do not depart markedly from j N (Guilford, 1965) 

The phi coefficient Another index is the phi coefficient (r^) . 
Its foriaula is 

AD - BC [9] 

♦ / (A+B) (A*C) (B>D) (C*DJ 

where A, B, C, and D are defined as before.. The phi coe .icient is a 
special case of the Pearson product -moment correlation that is calcu- 
lated on two inherently dichotombus variables. 

Normally, the computation of th*: coefficients requires two sets 
of data (resulting from two administr^f vms of a test). However, in 
the course of the ccasputer calculation of coefficient beta, a "grand'* 
fourfold table with .entries equal to the means of the results of all 
possible split-nalf categorizations is easily constructed. It follows 
from the analysis of the derivation of coefficient beta given in Chap- 
ter III, and from Equation 3 in particular, that the entries in the 
cells of this ^'grund^' fourfold table are: 



66 



55 



ERIC 



X=2k 



X-k 

I 
j=k 



(i) 



n 

X^+k 



x=o ^ 



2k- 2 

x=k ' 



k-i . . 

j=X-(k-l) 



x> 



n 
2 



X^ 



= ^ (N 



A* - D*) 



In this study, the cosine-pi estimate and the phi coefficient are cal- 
culated from, this ^and" table, and under these conditions they can be. 
construed as single-idministration indices. Note, for example, that the 
r thus calculated is not equal to the mean of all possible split-half 
phi coefficie'^.ts--the computer program was not designed to ,do the 
calculations required--but rather is a single coefficient calculated 
from a table resulting from all possible split-half nonmastery /mastery 
categorizations. 

Coefficient kappa Millman (1974) and Swaminathan at al. (1974) 
have proposed that coefficient kappa (Cohen. 1960), an index originally 
developed for nominal data, rather than the coefficient of agreement, is 
the appropriate index to use for dual-administration CRT reliability. 
The computing formula for k is 



Po - Pc 



67 



56 



where, in the case of dichotosous categorisation for sach adsmistra- 
tion, 

, p . '■ ..■ 

the observed proportion of like categorizations (i.e., the coefficient 
of agreement) , and 



Pc =■ [-iTAir) * iiTAir; • 



the ''expected'' (by chance) proportion of like categorixations (i *e. , the 
- sum of products of marginal proportions, as in a cJ^i -square test of a**- 
sociation) . 

The advantage claimed -for coefficient kappa is that it reesoves 
from the final coefficient that proportion of a^jreement due to chance, 
that is^ the expected proportions in the population. It sectas unclear, 
however, what interpretation should be given to the notion of popula- 
tion proportion. In the case of an attribute with truly nominal values, 
say eye color, it makes sense to talk^^f the proportion of the popuTa- 
tion with hazel eye$ (given> of course, a suitable measurejnent process 
for identifying **hazer*) . But for such an ephemeral attribute as degree 
of mastery of a given behavioral objective, where fifteen minutes of 
instruction may well change a person from the non -master)' category to 
the mastery category, the '^expected proportion** of the population in 
one category is not so clear. 

Since coefficient kappa is a dual-administration index, it is not 
ifithin the scope of this study. It would be interesting, however, to 



consider a siiigle-adjainistrlition coefficient equil to the mean of all 
possible split-half kappa coefficients. Unfortunate ly» the algebra in- 
volved is forbidding.. ^ 

However, rather than take the oean of all possible split-half 
kappa coefficients, one can treat coefficient kappa in the sasx way as 
Ih e ' cosine > p r es t iiwite' a n3"12ic ph I" c oef ft Cl entr "naisr ' v;- -oTre- can-catcu- 
i»ie the kappa coefficient froa the "grand" fourfoic table, w+^ich gives 
the sMrans of all possible splir-half categori rations . Recall froa the 
derivation of co^sfficient bet:-, that :cells 8 and C in the "grac^" fcur- 
fold table ar* ■!>qual because of thc_syi»ctTy ioqplicd in labeling the 
halves of t-^e wst. To indicate this, let 

.8 « C * E. 

and let the -superscript (•) denote a coefficient calculated from the 

'^grand'* table. Then ^ . . 

A ^ D T(A->E) (A*E) ^ (D>E) {D^E)" ) 



Pq - Pc 



With a iittle algebra (see Appendix A) this can be simplified to 

-. - AD - E^ 

Kote# hwever* iphat antlor these same conditions of B » C » E, the phi 
i;o?efflcitait (from Eflvation 9jl 'is 



• AD - 



(A.E){&^E) ' ^^^^ 



%rith the phi ccHSffflcient (rjj . Wcrtfcver if w^^e jIctta^^w luejrjr- rp^it- 
half i:.2p^pa co-cf f i cient r» to be is r T'-5 

hypothesis results f roa s.cne liaircrd eja5^ir;.cfel r - ^^e-iiC i I re- 
search by t^;s author^ a^r^i- is va.se^c vr; i :jcflrp-:&ri jo^t vf t-Vt t:*^. -n;^.-*:;' 
ties calculated frc.'a i fc^ BAxs^f actuTe^ sco^r* di« tri.t^vti.-v:i;i tjuc jv-jatt 
izt'^X'by-p'^pi] 7t'iy^r:xt aatric^i t.t s.'^ci &te«i ^itJt wt^ti. i:i.st.ri."::^-ti-ctt . 
All ex-2:.t-pic$ suppcTtei t^hii te^.tativ* r«uit-; WbT*^iveT, th.-^ iiitt^rrTc^ 
between the t-'O cxpre-ssio^ns ii ^^^jiall/ ^lijjr^t. £r..T^rii2/ :r^t »cre ' t,t.i-T;, 
.OS. For ^/..irfle^ * hrpozhtric^l te^t li'lth ;5i HAit-er/ -rit^r^^n. 

cf sr:i teri e j-aat; nee-s *::.h 3r ict^l icr^T% cec^vr 

FC'ur different itCT-b/-p»u?i i rc:;pcr.se a-atrlc^i /ield valuo c»f. r r:5*;g- 
ing fn^r. . 2?i7 to ,3*62. Tnus, in interpreting the results CJ^cic^mirrg 
the phi coefficient presented in Chapter VI, one shotild bear in aini 
that r* appears to be a fgencrall/ close) IciNfer bfOtxui to <r . 

It is also of interest to note that ! - or 1 - '-^ id^r.t: ?, 

with 1 calculated fro« a fo^jrfold tabU 4*^trj e<jual off 4 La gon-sti ceils^ 
where 1 IS the csriantc of the iihditx of incon:tisterjcy uj'-j for bi-r^coial 
data by the Bure:' i of the Censu^, as reported by Cochrafi (1968, p. 6<r3) . 

70 

0 



2. Cvi?: v.*^ chAT^cz^TlMicx of tht thrs* other crittrim- 
i^^A^T.Z t*j^t indic*^ d^firit-if in Chapter IV? 

3. Are there predict^able relationships between coefficierit 
beta ar»<l any or all of thes^ three indices? 

4. Are there predictable relationships between coefficient 
beta and other fourfold contingency t;3ble indices? 

Large ardour t^i of systetaatic data are needed to obtain satisfac- 
tory ansirfers to these questitms. Prohibitive expenditures of resources 
and inordinate cooperation fro;ts schools would be required to collect 
such data cBpirically, and hence the iita werr simiiatcd by computer. 

71 

59 



k co«5^jti^r prt^txaai dwipMt^ by tfc* imtikXlVkUn written 
fcy ♦ ooil^a^^ie to i^rjerra tc ^ data £ot tJM5 ftaoidy. Tbe purpose and 
i^i^A of tiie prti^riw i#er« thTMfoW: (1) to ttwlate th« wtsttlts of 

of O's (2) to mIlo*r for sy»t««attc control of the ^©twnratlon 

of tJ-.w^ aatrice^ by providing great flexibility in the definition of 
i;ap jt j:t»ras3iet«T5, to be discussed later; and (S) to create graphic aids 
ar^ to calculate various statistics, including those used In ^U^is study^ 
frcai each siawlated response atatrix. 

The first step in using the computer prognui Is to define the in- 
put paraaoters, discussed in the next section. Then the program gener- 
ates a response »atrix of 0*s and I's according to the equation 

where ^ 

Tjp^ iS' the response to the ith lt«i by the pth person on the 

tth trial (or replication) of the test^ 

g. is the •♦goodness** of an itc«, akin to itea^test correlation, 

with range [0, 1]; 

Cp Is the '^cowpctcnce" of the person on the behavioral objective 
. being •casurcd, with range 10,1]; 

dj Is the ♦•facility** and therefore l-d^^ Is the intrinsic difficulty 
of an itcs, with range {0,1]; 



72 



61 



and the e^s arc nonBally distributed TS^ndom cocoponents each with an 0x7 
pected value of 0, but whose variance ©ay be specified. The' first is 
a persons-by- trials cowponent: persons feel differt t from day to day 
and iiould react tcf tests differently as a result. The second is a 
(generally larger) items-by-persons coiapohent: it is not realistic to 
assuBsc that a given item will have the same difficulty, relative to 
other items, for each person. The third is a catch-all, undefined 
coiiq;>onent that varies over items, persons, and trials, and may be 
thought of as related to errors of measurement « 

TheSresponse ii^ counted as correct or incorrect, and thus the ele 
ment in the response matrix* is 1 or 0 as r^^^ > 0 or r^p^ < 0, res- 
pectively. 

Note that when an item is perfectly "good,*" gi « l» and when the 
persons-by-trials error component is ignored, Equation II reduces to 

• - CI - d.) 



^pt • ^; • • 



0 



implying that the response is recorded as correct when the person is at 
least as cvisrpetent as the item is difficult, fwther, a perfectly "bad' 

■ % 

item would be one with gj^ « 0; in this case the basic equation [11] re-- 

duceis to 

• r, ^ « e^*iL ^ ^ 
ipt jfp ipt 

implying that the correctness of the resjponse is due completely to ran- 
dom^^f actors. Note further tha,t for a perfectly good item the effect 
of the item-by-person error vanishes; this effect docs appear when the 
item is not perfectly good. These values of item goodness are limits 



73 . 



62 



rather than realities, of fj'ursc^ snd thus the valu * of gj^ actually 
used lii the l^ye^tigatJ,on weM bow«eiot thef« extre»f«. % 

ill ^^icr to clarily ko¥ thf basic equation functions » cms^r 
the cat^»<^4e-rax tube analogy shom in Figure 3. Thi«lc of the value 
of <^ e'pj^ as an Passion point m a cathode/ and consider the value 
of I - as a hole in a ptld. Tlitjn ♦ e'^^ - (1 - d^^) could be 




teTiaed ''initial direction."* The particle is emitted with initial 
velocity and passes through an electromagnetic field of strength 
♦ ©1 ^ toward the anode, r. * Ks EqMation II indicates, the 

Ip ipt Ipv • 

greater th< velocity gj, the less the effect of the interference field^ 

In the example shovn in Figure 3, c^ ^ e'^^ ^ -62 and i - d^ « *S4^ 
resulting in ^n initial direction of (c ♦ c'^^ - (I - d^) « .08* If 
the velocity of the particle were great enough in comparison to the 
strength of the Imerference field, the particle would continue o 
the upper, groater-than-zero half of the anode and the entry in the re- 
sponse matrix would be K In this cxac^lc, however, the error componctas 
are large enough with respect to to end the path of' the particle 
downward to the less -than* zero^ half of ti anode » and the entry in the 
natrix is 0. 

It should be further noted that since the computer prograa is 
designed to simulate the results of the test-taking process rather than 
the process itself, the relationships in the computer model among such 
thini? w item facility, examinee ability, and test mean are not ncccs- 
saril> those one might expect. For example, test mean is not an 
algebraic function of item facility alone (as it is in the usual test 
models), btit rather is only influenced by it, and then only in conjunc- 
tion with person con^etence (combined with it to produce "initial direc- 

tion'O andPsubJect to the effects' of both item goodness and the error 
components • ' v 



J 



Tht differences between the usual »odel and the ccMs^uter aodel 
used in this research arc due to i>ractlcal rather than theoretical 
considerations: the usual model does not readily lend Itself to coa- 
putcr sitBuldtion since its Inner relationships are necessarily bound 
up with the unpredictability of hu»ari behavior. The co^uter jsodcl 
was evolved over a period of titae as the best procedure that the author 
and his coraputcr-progxaoaaer colleague could devise in order to slmilatc 
the results of the usual test-taking process ♦ 

In Figure 4. the usual relationships (A) and those of the com--' 
puttjr i^dcl (B) are compared. Arrows Indicate directions of relation- 
ships, solid lines indicate direct relationships, and dotted lines ift-' , 
dicate indirect relationships. 



Ok Art Liu; 




X 



Z 



initial 
*Urf»ct ion 















* ■ 


nub 




UOM 








arror 


^ 


t«Gt 






ranint nn 








ccMiponcnta 







Fieurc 4. Relationships between item, examinee, and test characteristics; 
a comparison of the classical (A) and computer (B) models. 



76 



65 

r 



Input Paraxnet^rrs , * 

Thii coissputer pro^^r^im offers a^ide tmgc of options for defining 
the three ©ajor vectors (c^, 3^ iimi gp (see Appendix C for ©ore de- 
tail). Ho^^cvcr, for the purposes oi this study, only a HwUed variety 
of options ua^ used. 

The conipetencc >cctor was restricted to two t>nper. One is 3 nor- 

2 

^al distribution (Figiire S), with u « and 0 such that all values 

generated lie between 0 and ! (explained more fully in Appendix C). 
This competence distribution was chosen to reflect the classical <ii$5i0ip- 
tions about ability within n population. 



« « * • 

« « * 1* 

• « • • 

« « « « 

4 # « « •« 4 



S oinotnOinoinOino'/>otnovnotngtn 



Figure 5. Histogram of components of a normally distributed 
competence vector 



77 



66 



The second t)^>e is a bleodal, ''inverse nomal" distribution 
(I ;?, c 6), which is josscntial ly what iwouid be obtained if a norma 1 dis- 
tntution were cut in half at the center, the left'half translated .5 
to the right, and the right half translated .5 to the l^t. This co»> 
{setence di'stribution was chosen t6 reflect the notion that, for a given 
behavioral objective, a" student jgonerally either has or has not eastered 
the objective. 































* 
































• 


























* 


* 




























• 






























• 
































• 
































♦ 


* 


• 


















































# 




* 




* 




• 
























• 




• 


























• 




* 


• 


















# 






• 


« 




• 
















• 










<• 






















* 


































• 








ft 














* 


# 


• 














• 












* 


# 


m 




• 




#' 




* 




« 




















♦ 


Ik 




■* 


• 










* 


* 








1ft 




♦ 


* 












t 




ft 






* 






# 














• 




# 


• 










« 




# 






• 


4 « 




■t 












• 


• 


* 


-* 








# * N « # 






• 




• 


* 





o»noulOfcnovrjOiAO*AO*Ao*AO*no^n 



Figure 6. Histogram of componcjjts of a blwodal 
cottpcrtcnco vector (c^ 



The item facility and goodness Vectors cajilcycd in the study were 
all uniformly distributed, but their upper and lower bounds varied ac- 
^ cording to preset con^J^i tions • 



er|c 78 



67 



Eight sets of paraffieters were used, resulting in eight families 
of response aatrices, score distributions, and test Jndices. Particu- 
lar combinations of paraaeters were chosen ' ^ simulate responses to 
throe types of tests* 

The first tyi^c of test has a txjdcrate nunber of itetas, relatively 
low iteffl goodness, and a wide ranged of itets facilities. It ds perhaps 
best excJ^tpUfiod by a poorly-w^ritten teacher-constructed test- Paraneter 
sets I nnd 2, which use the normal aiid bimodal coiapetence vectors, 
respectively, are of this typ<s. Exaxnples of the resulting distributions, 
"To^^TTTKihe accoajpanying basic test statistics, are given in Figures 
7 and 8. The$.c fc^asic test statistics are p, the test.roean expressed as 
average item difficulty; ^V, the variance expressed as u percent of rsjaxi- 
mn possible variance for an n^-itetn test; the index of separation 
(Equation 6); and r, a classical internal consistency reliability esticmte 



s • .12 ; 

• * 

* « # # 



Figure^ 7: Score distribution resulting from 
^ parameter sot !• 



79 



68 



p • •60 
%V • 10 
S - .14 
r * .55 



Figure 8: Score distribution resulting from 
paraaeter set 2. 



The second type of test is short, with relatively high item good- 
Hpss and ji minimal ranfic of item difficulty. It is perhaps best ex- 
emplificd by a well -constructed criterion-referenced test for a narrow, 
specific behavioral objective, such as would be found in mathematics. 
Parameter sets 3 and 4^ which use the normal and biitvodal competence 
vectors, respectively, are of this type, (see Figures 9 and 10) 

The third type of test is long, with intertediate ranges of item * 
facility and goodness, simulating, a more traditional /standardized test, 
such as would be found in a field like science. Parameter sets S, 6, 7, 
and 8 are all of this type. Sets 5 and 6 (see Figures 11 and 12) utilize 
the norinal and biraodal competence vectors, respectively. 



69 



P " 
%V 

s = 
r = 



.97 
23 
.78 
.89 



* 

* * 

* * 



* * 

* * 



* 
* 

* * 













































































































































































































J. 
« 






w 






w 






w 






w 






w 




































w 












w 






w 












* 






* 






* 






* 






* 






* 




* 


* 




* 


* 




* 


* 




* 


* 


* 


* 


* 




* 


* 


* 


* 


* 




* 


* 




* 


* 


* 


* 


* 


* 


* 


* 



p = 
%v = 
s = 
r = 



.62 
= 72 
.78 
.97 



V 



OrHCMcnTfinvDr^oocTio 



Figure 9: 



Score distribution 
resulting from 
parameter set 3 



r 



* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 



* 

* * 

* * 

* * 



* * * * 

* * * * 

* * * * 

* * * * 



* 
* 

* * 

* * 

* * * 



* 
* 
* 
* 
* 
* 
* 
* 

* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 
* 

* 
* 
* 
* 
* 
* 
* 

* 
* 
* 
* 
* 

* * 

* * 



Figure 10: 



Score distribution 
resulting from 
parameter set 4 



81 



70 



p « .69 
%V » 8 
S « .22 
r « .72 



* 

* * * 



* 
* 
* 



* 
* 
* 

* 



















* 




* 


























* 




























* ' 








* 






* 










* 


* 








* 






it- 


* 




* 




* 


* 






* 


* 






* 


* 


* 


* 




* 


• 




* 


* 


* 






















* 


* 


* 


* 


* 


-* 


* 


* 


* 


* 


* 


« 


* 


* 


* 


* 


• 


*. 


* 


* 


k 


* 






* 


* 


* 


* 


* 


* 


■* 




* 


it 


* 


* 




* 


* 


* 


it 


* 


• 


• 








* 




* 





* 
* 



Or-l<Nr>Trinvor*CDOO»HfMm 



Figure 11: Score distribution resulting from parameter set 5. 



P = 
%V 
S « 
r = 



.65 
26 
.35 
,94 



* 

* * 

* * * 
* * * * 



* 

* 
* 
* 

* i 



* 
* 
* 

* * * 
***** 



• * 

* * 



^ in vo 

f-* r-i 









* 


















* 




* 














* 






















• 














* 




* 














* 




* 














* 


















* 




• 














* 




* 














* 




* 












* 


* 




* 












* 


* 




• 












*- 


* 


* 


* 


• 








it 


* 


* 


* 


* 


* 








it 


* 


* 


* 


* 


'41 








* 


* 


* 


* 




* 


• 




* 


* 


* 














* 


* 


* 


* 


• 






* 




' * 


*- 


* 


* 


* 




■« 


* 




* 


* 


* 


* 


* 


* 


• 


• 




* * 


* 


* 


* 








• 


* 







































Figure 12: Score distribution resulting frott parameter set 6, 



11 



For s«t >, ;alt puraRrjetors ^re the sum as for set S except for atear 
fiictllty: t c i^st: is aaTe^^fjfi^t^^ 
. scor^^t and a lowet test raeaii <fee Figu^'lS)^,. . 



■ ■ ■ # . i- « ■ » «r ♦ L- ♦ * '* * * * 



s i 1 1 * » • ■» r* *• « ,* *. • * . * 

- ■ ■• ■ ■ • • ■■■■■ \ ^ 



F'gii-rc scoro- distribution; refluitin^ ■f.ir<i!a paxaaeto^- set 7> 



83 



Parai»eter set 8 is the s^r^ « s€t 6 except that the standard deviations 

of the error components are sniaUer. This set was chosen because the 

resulting score distribution f*'^ ^igux j i4) closely approxiiaates that 

of enyirical score distribute ^ tests being developed at th^is- 

consin Research and bevelopmch. -ater for Cognitive Learning, where this 

« 

str dy ^'-^ conduct<:d. 



c 



P ' 
%v 

s - 



.66 
44 
54 



• 

* • 



• 

• * # 



« ft 

ft ft 

ft ft ft 

* ft ? ft 



* 

ft ft 



% : 

ft 

- t- 

ft 
ft 
ft 
ft 

ft 

ft ft 

ft * ft 

ft ft ft ft 

ft ft ft ft 

ft ft • ft f 

ft ^ft ft ft ft ( 



Figure 14: Score distributio^v^esulting from par* ^eteir set 8, 



EKLC 



84 



73 



Table 4 j^ives \he nxMri^t^l values frjt th^sj^ eig^^t p^arauwrter 

sets. In all casej, Cp i:^ iilthtr r*oraal or bi^s^^l fiir/erse njsraal) at 
described previous]/, %dth » .S, Cj » 0, • 2; a/irf g^are 

uniforxaly distributed ^^ithin the inter/&i5 ^.hr/v^a in the ta>vle. 



iKKrr >MJr>j^zrK7fZ vJ>.r> idyls' 77i« s?7vr:;r 



teet 
t7J>e 


»iet er 

a-et 




— 1« 


! ! 

\ i J . 






sin 


; raj: [ £:i.r; 

j 






I 


1 
2 


N 
B 


c 




r ' 

! 




1 

^ r. 








20 


2 


3 
4 


a 


.7 




r 














3 


> 
6 


N 
B 


.6 




,1.1 


.4 


.01 






200 


40 


7 










.4 


.01 


.r>4 


.02 


200 ^ 

1 


<0 




B 


.6 


.9 


.2 


.4 


.01 


.02 


.01 


200 j «0 



i 



85 



74 



W>j^t *r*j vaJ^m?* vf lit itfifi'^'i .iJrj (Mfft^t III, for ^ 

Ufy^ \U ^JHiitlf^i vf at) it r#tfe<rr comjii^^, to tints wf?r:.th if. 

<^y«ttJvf* Iff irf^Uf^ftJ ^itf/Jfr/t -i^r^ utMA.^ uj rMt conVrih^^tiont to 

for ^^r/frJtit-t^r <lot?$ cr>ef f i /.i^nt beta v«r:' a 

criU'/j/^fi ].«ryt3 'iiinjiiesl' 7» a/is-wtrr this -li^i^^tion, r<^i^poj.i*;^ ^tr. 

t^n^rttf^'i tor of ^i-^ht p^rmttwt r^^r.^, gvaphi. y^re 

7V.t^ j/riif^ht itnd antwtrri to g]i r^oestlons which follow. :» 
yen j/^O.i^j^tfr VI . 

bjuin^t^i. in' f^'att:<r-'-ro answer thit, foijr mtriccs were genfc t ?.ed for 
e^^h ^a/aJi-eT,^./ iet, u^lng 2:. 4^, iO^;, ^.ad 4W ex^ Iriee^. 

WTi5ft i^i the beh^jvior of oo^ffic:ent beta as the nuffi'^er of 5 tews 
im:rf:atesv Is. the Spearman -Br(/wn prophecy forruui.a ipiMir-:ibler ^o an- 
swer tht.r^e r iestioni^, four ma).jncci wore gJtntrat<;^ for ^ach par i.neter 
•vt, usir.- ID, 20, 40, md 80 items, gi'aps were drawr>, and v-iricMs 
rr:grct'. ion analytics were carried out, 

e. Are rhcrc predictah] e reiat ionships bcTwccN cocffic.^rct beta snd 
th; following basic test ntat i,sti C5. : (1) test mean, exprcn^..:! as a . ri- 
cent (i.e.. mean item difficulty). (2) score variance cxprer-scd as a 

86 



75 



pisrcent of thff maxinui^s p variance for a te3l of n items, 

f3) inde>: of s^rparaMon, (4) (.of;ilxci«jnt ^^pha (KR-20) , (5) KR-21, and. 

To answer these questions, various analyses, iiicludf-^g stepwise analysis 
of regression, often non-linear, were carried out on the da^ta generated 
in answer to Question lb. ^ 

2. What are the character! ics of three other criterion -dependent 
single-a dministration indices? Harris's index of efficiency, Livingston* 
criterion-referenced reliability coefficient, and the criterion-referenced 
index of separation, all discussed in Chapter IV, were computed for the 
same parameter sets as. those for which coefficient beta had been calcu- 
lated. The analyses were similar to those mentioned under Question 1. 

3. Are t here predictable relationship twe* ?n coefficient,,,..b eta and 
any or all of these three indices? T^is question was answcrad througli 
graphs and analyses of regression. 

4 . Are there prcdic txiblc- r4^ 1 ationships-between -eoe f^FireiefVt--b ^£'^a-~ttnd-" 

other fou rfold tabic indices ? The cosine-pi estimate and the phi co- 
efficient (and hence coefficient kappa with equal off-diagonal cells)* 
were calculated for the parameter sets from the table resulting from all 
possible split-halves. Data were analyzed through graphs and regression ' 

41 

analyses. 

r 87 



76 



' s 

The Regression Analyses / 

The i^egression analysis routing chosen for the study, STEPRECl (1973), 
is part of the University of Wisconsin computer center's standard 
statistical analysis package. The basic purposes of this stepwise analy- 
sis of regression program are to analyze the manner (and degree) to 
which the varian e of the dependent variable is explained by variation 
in the independent variables, and. to^calculate regression equations. 
The stepwise feature of this statistical technique allows one to intro- 

duce independent variables into, the regression equation in any number 

- ^ 

and in any order, either singly or in groups. If some or all of the 
variable*- are allowed to enter as^ a group, the program determine the 
magnitude of the contributions of each^ of these variables toward explain- ' 
ing ^e variance of the dependent variable and^ allows these variables 
to enter the regression equation in order of the magnitude of their con- 
tributions. Thus one can analyze not only which independent variable^ 
help explain the behavior of the dependent variable, but also which '^nes 
are most important The result can be interpreted as rerpreserting : 
quantified "sptiogram" of the. indices in the artalysis. 

, _ Step wise analyses of regression were u sed J:atho.r_ejritens^i 

this sftidy because the proceduie jj^nde it possible to analyze the manner 
in whidh c i^fficie^it beta and other indices are relied to various test 
statistics and to each other. 

/ ■ • ^ ' - 



88 



CHAPTER VI ' 
RESULTS AND CONCLUSIONS 

n 

This chapter i3 in several sections, roughly corresponding to the 
questions set ^orth in the previous chapter. The first section deals 
with the characteristics of coefficient beta, which was developed in 
Chapter III. and its relationships to various test parameters and basic 
test statistics (including classical reliability) . The following three 
sections deal similarly with the three other recently-suggested test , 
indices that were defined and briefly discussed in Chapter IV. The 
last section discusses the relationships of these four indices among 
t'uomselves and to the cosine-pi estimate and the phi coefficient de- 
fined in Chapter IV. ^ 

Characteristics of Coefficient Beta 

Values of ^j t) 

As mentioned earlier, one approach to. the analysis of coefficient 



6eta is to investigate, its component parts. Recall from Equation 2 
that f, .V'^ 

.Here N is the number of examinee^;, Xp is the pth person's total score, 
and *^ is as defined in Chapter III. Since X^is a member of the set 
{0,i,...,n}, it is useful to inspect the values of <i(X) for each X in 
{0,1,. ...n) Table 3 shows thsse values of *(X) to two decinol places 



77 



for a 20~itcm test with a criterion level of 0.7. 

TABLE 5 

Values of (|)(X) for n = 20, c = .7 



X 


0 - 6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


17 - 20 




1,00 


.99^ 


.98 


.93 


,82 


.63 


.35 


.00 


.37 


.70 


.91 


1.00 



As can be seen, (J) (X) de :reases as a person's score nears 13, which is 
the integer 2k-l as defined in Chapter III. In general, the farther 
a person's score is from the cutoff, the greater ^s* (fCX). 

As noted earlier. Figure 1 gives a graph of these values of X 
and ({)(X) , Other graphs of : ((>(X) , for selected numbers of items and cri 
terion levels, can be founc in Appendix B (Figures Bl through B7) . 

Coefficient beta and criterion level 

As described in i.haptcr V, the eight different sets of input par 
ineters selected : r the computer f rogr-im generated eight families of 
simulated test score distributions. Since the ritcxion level an 
intcgrn' part of the formuia for coefficient beta, the va,|Lue of the co 
efficient wiM tend to vary as the criterion level changes. Recall t'. 
the formula for. coefficient beta (see Equation 5) contains kr the iftini 
mum score roqi'^^ i<^^^ ^ mastery cHss^ification on a haif-tcs^ 
Since k of necessity lies in the s-ti i 1,3, . . . ,n/2) fcr an n ^ i ic& test 

90 ■ , . 




79 



there are n/2 possible criterion levels and hence n/2 meanitigful cut- 
off scores. Thus, as far as the computation of 0 is concerned, there 
are only half as many meaningful criterion levels, and hence half a? 
many values of 3, as there are items oh the tesf. 

In an actual test situation, the mastery criterion level is un- 
likely to be less than 0.5, and is perhaps most likely to be in the 
range [.6,. 9]. Nonetheless, for the sake of thoroughness, the values 
of coefficient beta for all possible criterion levels from 2/n to 1 are 
shovm in Figures 15 through. 22 for parameter sets 1 through 8. On each 
graph, the abscissa Is the criterion level and the ordinate is the value 
of e. For reasons to be discussed shortly, a bar graph of the relative 
frequency distribution of total scores (see Figures 7 through 14), on 
the same scale as the criterion level, is also given along the abscissa 
of each graph. Also ir 'eluded with ea.h graph are certain basic test 
statistics (defined in the last chapter): p, the test. mean; %V, the 

■t 

percent vaj.-iance; S, the index of eparation; and r, a classical relia- 

bility esti[?.3te. For the graphs of B, as well as of and (to 

be given later), the classical re.liability_esti.inat-e_iaJCR-2i,_since- 

this statistic is computed from the- frequency distribution of total 

2 

scores as are the three CRT indices. For the graphs of k^^ (to be 
^iven later), the classical reliability estimate is fCR-20, or co- 
efficient alpha, since these statistics are computed from the item-by- 
person response matri" . 



91 



80 




1.0 



Fig. 
17 













- -t — 
•f ^ 

■f- 


•t V- 




■ . i t * ■• 

. -i i t i 

' ' ' t 




\ 

1 


; I - 
f »- 


; • 1 j 

.87 ; ' 






; %v = 

1 s = 

-lU!, 


23- i 

. 78: ! 

.89i|. 


ini 

i . 1 
• 1 I t 


1 : 1 i 


.! i ! i 


: » 1 : 

i . A : , 


1 ! 






Figures IS- IS: 



Graphs of coefficient beta against crite. ion lc\el, 
with score distribution relative frequencies, 
for parameter sets 1-4^ 



81 



c. 



ERIC 






w 



Fig. .5^ 
25 



[as 



i4p " 
zv > 

s - 

r « 



.66 

I 4A 

.54 
.97 



jiHl.lliiL-ri.Mii. 



T 



t. 



Firures 19-22: 



Graphs of coefficient beta against criterion level, 
with score distribution relative freqtSfticies , 
for parameter sets 5-8. 



93 



.82 



The graphs show that as c approaches 0, 3 approaches 1. This 
limiting value is reasonable since a criterion level of 0 "separates'* 
those examinees with a score of 0 or more from those examinees with a 
score of le3s than 0, an impossibility. Hence the "dichotomization'' is 
pe-fp^ct, aithcasH in a degenerate sense. Also, in general, as c ap- 
proaches 1, 6 again approaches 1. The exceptions seem to be in Figures 
17 and 22, both of which show a relatively large number of scores sligh 
ly less than , the number of items. If one could set a criterion level 
greater than 1, B would take the value 1 at that criterion level since, 
lik^ the case c = 0, c > 1 implies* '"separation'* of those examinees with 
scores greater than n (another impossibi lity) from those examinees with 
scores less than or equal to n. 
Coefficient beta and the score distribution 

Coefficient beta does not approach 1 as c approaches 1 in the graph 
of Figures 17 and 22 because of the interaction between 3 and the dis- 
tribution of total scores. Recall that one property deemed desirable 
for a CRT reliability index was that such a coefficient should increase 
as scores depart from the cutoff. With the exception of Figure 18, 
Figures 15^ through 22 show that this is indeed the cr^se with coefficient 
beta, although these graphs show this relationship in another way; in 
these figures, the efficient increases not as 'le scores depart from 
the cutoff, but as the cutoff departs from the mode(s) of the score dis- 
tribution. Perhaps the best examples of this phenomenon are sho*^ in 
Figures 19 through 22, where there are more items on the tes Aarui thu^ 
smoother curves of B values. v- 



83 



Note, however, that the curve of g values "lags behind" the bar 
graph representing the frequency c'stribution of scores. This lag is 

most easily discerned on the graphs with clearly defined score distri- 

/■ ■ ■ 

bution modes: Figures 15, 19, 21, and 22. In these instances the cri- 
terion' level corresVpnding to a cutoff score immediately above the mode (s) 
yields the minimum - va lue (s) of the coefficient. The lag is due to the 
fact tha the score that cot^tributes zero to 6 is 2k-l, one less than 
the cutoff, TTiis explains why & does not have its minimum value at 
the mode in Figure 18, and why it does not drop as sharply as one might 
expect at the mode in Figure 17. - 

At any rate, it is clear that the shape and modes of the score, dis- 
tribution in relation to the cutoff have an important effect on the 
value of 3 . -^-^ 

Coefficient beta and basic test statistics 

The basic test statistics considered in this section are those 
given in Figures 15 through 22 and described earlier. They are invariant 
•for a given item-by-person- response matrix; they do not change as the 
criterion level varies. \i\e data availabls for this and later sta '-^- 
tical analyses include values of 3 at all' possible criterion levels 
for 24 score distributions: 3 representatives of each of the .eight dis- 
tribution types. Since B, unlike the basic, test statistics, yaries as 
criterion level varies, it is not meaningful to include all data points 
in an analysis comparing B to these basic test statistics. One can. how- 
ever, investigate the relationship if>^e variance in By due to the 

95 



B4 



changing criterion level is removed. This can be done in one of (at 
least) two ways: by taking either the minimum or the mean value of 8 
over all criteridflrirgVels for a given score distribution. Table 6 shows 
the rank order of the eight distributions ^^n each basic test statistic, 

as well as on miT<,(B) and B- * 

■ *^ . I. • . 

TABLE 6 

ORDINAL RANK OF EACH DISTRIBUTION ON- THE VAP-ABLE. 
INDICATED AT TOP OF COLUMN 





1 = 


low; 8 


= HIGH 








Distribution 














from Fig. No. 


P 


%v 


S KR-21 


min 0 


B 


15 


3 


- 1 


2 


1 


1 


i 


16 1 


■ ;? 


3 


3 


2 


2 


2 


17 


8 


5 


7 


5 


7 . 


€ 


.18 


4 


8 


8 


7 


t 8 


8 


19 


7 


.2 




3 


. 3 , 


4 


20 


5 


6 


5 


6 


6 


5 


21 


1 


4 


1 




4 


3 


22 


6 


7 


6 


8 


S 


7 



From the data in Table Spearman* s rank-order corrcraticn^ (p) 
was coii^putcd for both min(6) and "3 against oiicn of the t:aur ba$i c.cst. 
statistics. Table 7 presents these computed .aluci> of o. The corr^pul^v J 
p is at least as high for 6 as for;gtnin(i^} .in each case. 

96 



ERIC 



85 



TABU 7 ; . 

VALUES OF SPEARMAN'S W{C» (RANK-ORDER CORRELATION) 
BETWEEN HIN B, ^. AND BASIC TEST STATISTICS 





P 


%V 


s 


- iaR-21 


tain 6 ^ 




.88 


.83 


.83 


i 


.55 


.90 


.90 


.93 


Dean appears 


to have 


little- 


to do 


with coefficient beta 



The best, correspondence scccrs to be that cf with 6. . However, it 

^should be pointed out that other test indices, which are analyzed later, 
correspond about equally well with some of the same basic test statistics 

Coefficient beta and the nuinber of examinees 

' — ■ — 

For a given set of tc^^ijpa^^^^ters and a given criterion l.evel> 
variation in the nuiaber of examinees does not seem to have any systeraa- 
tic ef^e^t on the value of 0. Figure 23 is a scatterplot of. 6 for 2N 
(or> in some cases. 4N) examinees agaJnst Bj^- The pairs of nu^nbers used*^ 
were (25,49), (49,100), and (100,400). The correlation of 6^^ and 
(or Sj^ and 6^j^) wasThigh, .94. The Obtained 'linear regression equzuion 
was "Sljj^ ^ -.001154 +.99950^; which is v*^|^iose to the model S^j^ = 6^- 
In fact, the fit ii^ close enough to a ll4><w"f»<^to assume without qiialm 
that the model obtains in the population. This^suJt was . expected, 
since B =^ ^ ^ f <>.{X), and hence doubling the number Of examinees 
should merely tend to double each f^ (as well as double N) , resulting' 
in algebraic* cancellation^. 



The 
Chan 

was no such connection 



number 49 wns^chciien in place of the perhaps md^r<>bvious SO, oj^the 
cc that there was n connection between ^/N and 6. Results showed ffiere 



97 



-.1 r---' A- 




87 



Coefficient Beta ajri<aA >^^Nu)^W«^>^''^g»p^/ 

For a given s^* / ttt^x ^''^y^rs a"*^"^ ^^^^'^ criterion level, 
variation in the XriC^^ }%<,ts th'^ value B: in -general, 

8 increases as the %^\t '^i^lf Increase's, Figure 24 is a scatter- 
plot of e for 2n iVV^g^^w / H iteia^' topjparison purposes, 

the star? in the fi^^ i^if'^J'^^' ^"^Vues of coefficient alpha: ag^in^^ 
a^); For this figu/^ U,^^/^"^" valued of (lO/^o) (20,40) and 
(40.80), The scatt'AA^ yi<i^(»<^'"'\s aU ca^ul^ted for all cri- 

t^rJLon levels on eiA a^^**' Vtions. one froN ,8ch of the eight 

pai-ametir set. . /'^>on of this set points (considered 

as » set of ordered \i}U) ^ High, - 



Figure 24 shoV' Vi> tJX^^V^"' >He lowc^ >' e is ^nc line 8-, . 6„. 
i.e.. what would be %f\a^A \^ huwber of items >^^i. no effect on 
the value of $ (heW V^th V^'"', '^^^ N-B li"«>' shows fairly 

clearly that most ojV\ {X?^^^ ""^"^ ^bove t^e N-E U^c ra^hor than 
evenly distributed /V\d H. ^^gressi^" equati^p was t^^ = .1999 ^ 
.81518 . consistent 1\ tr^^ (^^^"^''Vion t^at n^st o,f the scatterplot 
points lie abovp t^^^ ^\ae. case the coefficient of determina- 

tion (the squared 'c/V''lat^^^^„ •^"'^ ^hus the percent ^( varianc^^f 
accounted for by va/'V\p -881- i^, ggt of tne variance • 

exhibited in the vai V' ^y'^" K e)fpl»^"<^*^ the nwdel 

^2n " ■ y\ 20 • . 

The upper curv^^ \/ np^,^ ^ B^^ ' 'ue^ ' S^aph that would 

be expected if the K\.^f^.^<^'^ ^^ophccy ^o^"" i« hd^ (henceforth called. 



^ 100 



89 

\ / 

i 

i ■ 

the S-B curve) . At /first glance this would "appear to be a better 
iDodel than the lower line. Yet the points are not evenly distributed 
around the upper curve: more points are below it. than above it. The 
regression cq»jation for this inodel vras 'g^^^" -.5015 ♦1^307 ■ . ^ 
consistent with the observation that nore points are below the curve 
than above it. The coefficient of determination for th^s model was 
.887, only Minimally higner than that for the linear no-effect model. 
Hence the Spcapjan-Brown nwdel does not appear to'^explain the behavior 
of B better than the no-effect model. Nonetheless, using the evidence 
presented herei one coujd claitn that !he former mbdel does at least as 
well as the latter. 

It is iiruminating to put aside the computer-generated data for 
the moment. and^ brie fly investigate the behavior of 0 for some theoreti- 
cal score distributions: normal, unifb'no, and symmetric CMnverse 
normal") bimodal distributions. If for, each distribution is plotted 
against 8 , there appears to be a pattern. • Figures 25, 26, and 27 are 
scatterplots for the normal, uniform, and bimodal^ distributions, re- 
spectively. 

Notice that for a, normal distribution (Figure 25) , the points 
(^', 8- ) are approximately evenly distributed about the Spcarman;>8rown 

x:urv6 " ^^^^^ below the no-effdct, line S^^^ = 6^. 

n ■ 



103 



91 



In fact, none seems to fall below an iiaagined curve halfWay between 
thfe.S-B curveAnd the N-E line, SucK a half-way cur^ can be gener- 



ated by ■ / 

'■'/ 



2B 



n 



1 + 6. 



♦ e 



n 



112] 



In the epse of the iwiform distribution (Figure 26), all points lie 
between thc.&'-B curve and th« half-way curve- just described. And, 
although this figure does not show it, the data fro« Which the figure 
was drawn indicate that the points lie at -or near the half-way curve 
when the CiWterion lever is ''near .5 anc^ approach the S-B curve when 
the ci^terion" level is K . 



■ ^ 



104 



93 



For the bifflodal distribution (Figure 27), ftll iK>int5 appear to lie 
on ot between the S-B curve and the N-E line. Therefore, the value of 
coefhcient beta is apparently affected by the number of items on the . 
test: the wore the iteoi, the higher the value of 8 for a ^iven crl- 
r»*rion level and test type. 

the shape of the score- distribution seci?s to har^c some bearing 
< on whether the no-effect taodel (Bj,, - 8^^) or the Spearman- Brown oodcl 

(o . .) holds: for a sharply biaodal distribution, both models 

2n 1*B^ , ♦ . - 

scera to account for the. variance equally well; for a low-variance nor- 
Bttl distribution, the Speanuan'-BrovntJ model appears to account for the 
variance better than docs tfe no-effect roodel. 

Interestingly, the computer- generated data follow very closely the 
half-way curve Bodel 'described previously. An analysis of regression (of 
e for 2n iteiiss against for n Item) yielied a coefficient of 

detenaination of .884, about the same as fovr the earlier two, and a, 

regression equation o£jf^2n) ° -^^^^^ * ^-^^^ ^H(n) ' ^^^^ " ' 
Unlike the earlier two, this regression equation is so ne'-T to 6C2n) » 6^(n) 
' throne is tempted to hypothesize that -the half-way curve is the appro- 
■ prlatc model i^or the population, and that it should replace both the 
Spearoan- Brown prophtx-y formula and -the no-effect aodel as far as 8 is- 
concerned. (It ma'> also S?s, of course, that any appropriate prophecy 
• foraula must coac frca 3 totally different framework. This possibility 
• is discussed brlj?fly in Chapter VI J.) 



J 



10 G 



tlvlngston'sX^^/ <mlike coefficient ^eu. is hot additiv©; and 
thus thoTe J>'Ho 'psr/;iei with t\xc ^(X) analysis prcsonttui Tor 'e. There, 
•ro^^liwevcr,;, other parallels between the two Indices. As these c«yari- 
■s^jns'-'ar© atscussed in the last sce'tion or'-.thls chapter,, tWis ruction ifi 11 
iic coac^^rned oniy with the charra:icrl5tlcs of k'*^j(v 

k^j^ anu criterion .i^vel ■. - 

■ rue cooputing ToSjuia Hr i^-^^.. «Uch sh&«*^ its :re1atioriship 
"other t«3t statistics-,, w^s 'given' earlier (kqu&tioh -^l as;;- ' ' . . 

Thus has (usuaUy) 3 different, V^luc. .for eacii valM,® of XTV the tiii^off 
point. • !n fact, unUke coeffteient *oeta.;th^^^ valuos'of Sc-^ ..^ 

'for -..a given itca-by- pupil re'spon-ge aatrw is li.aitless;:/4nce Cn^d. n6t_ ■ 
be an integer (Li,vi7?gstQn, l^'?2a)' For this inve^-^tigation, ho^'et/cr,: 
values of C vere restricted to th*^ sai ;iOn, /ISn, . >.0n^ 

vhcrc n is- the m^'her of test ' iterss', the saise,:: (where acaningfui,) as iot 

coefficient beta. ;■. ,.' ' . ' ■ 

The grB^s, i.^ figUTf^$ '2S ^tt^- the value, vbf- k*-j^ at "tKe selec.._ 

tl criterion levels, for the' representatives of the eight; scfir^Cd 
tionff, " As beforet the relative frequency jJ:istf ibt.».-:i^ of total scor«5, 
on, tW sam ScJic as the' cri-,erion Uvei tis included with each graph, 
along with the' basic- "test statistics. ^. V ■ 



Fig. 
28 




Pig, 
29 




Fig. 
30 





Figures 28-31 



7 

Graphs of k against criterion level, 

with score distribution relative frequencies, 
for parniactcr sets J--*. 



ERIC 




Figures 32-35: Graphs of k^^, against criterion leveT. 

with score distribution relativlji^frequencics, 
for parameter sets 5-8. 



no 



0 

(The graphs indicate that, as would be expected frott Equation 13, 
has a minimum at the criterion level nearest the test mean (expressed 

•TX 

as a percent), and increases as C departs from the mean. Tn earlier re- 
searc>i on (Marshall, i973) it was reported that "when the mean de- 

parts from the criterion/ the coefficient accelerates xapidly toward 
unity/* and that *\thc coefficient^ generally has values, above ^5, 
and rarely drops below .90 [p. 14] These statements wew based on 
score distributions like those represented by Fig^ires 30, 31, and 35, 
As figure 28 show?, however, these statements do not hold for all kinds 
of test score distributiojis, particularly when classical reliability 
is low, 

^ and the score distribution ' , 

Unlike coefficient beta> Livingston's coefficient does not reflect 
the modes of the score distributidn. Instead, its behavior over changing 
criterion levels seems to be a function of only .the test mean and the 
classical reliability (and thus ,^ indirectly, of score var^nce) . Again, 
formula 13 indicates' that this must be the case, 

basjc test statistics . * 

.Two relationships, both of which fol-tow directly from formula 13, 

hold true for k^^j^: the minimum valuc^f k^^j^ (if the curve were u^ade 

continuous) is the same as KR-20, and this minimum value always occurs 

at the test mean. It follows that the rank-order correlation of the 

2 

minimum value (over ^criterion levels) of k with KR-20 is unity. 



100 . ' 



2 ■ ' 

k _„ and the nunjber 

T Jv X 




For a given sj t^-^t^^^^^^kfsrs a"^ « Riveit criterion level, 
variation^n the n\ij\f x^jcji^/f*^^ ^jja no* seeoi to effect th« value 
of k^^. This Ti:sui[ '^\$. ^>^c^^^'^ Hnc(f nw*er of exaninees should; .^^ 
not alter the values^ \^ classic^J reliabiUty, to- 

which is reW^ A^f*^^^ PiP*'^ i« ^ scatterplot of 

values of k^^.calcA Aa A ^H) eX»»inoes a^^fist ^Salculated 

on N examinees. Th*/V''\tteV <jf or (N^ 4N) the saxae as for 

the analysis of coeA^|\ftn(^ ^^S, 49), (^9' my, and (100, 400). . 

. The linear coA \t:iH t>air5 nua^ers ,^asi. very high, .978. 

The obtained regresAc eqi^^i^^ '^^^ V^Txf " -.10s6 ♦ 1-106 V,\^W> 



not too different f/V \ YxC''^-- 
k'^^ nnd the numbery ^Mjr^e^^^ ' 

^ • . - 

Livijigston (l^V^ h^^ f^^"^ \^t, theoretically, k^ 



adheres to t^e Spea/V\M\\ ^^''^^^^Y fP^""^®* Th^ 't^eory is supported 
by the results of t/^ \tuc^>, $7 i» scatt^^jot of k^^ 

for 2n items plotteA ^in^v/'^'' V n i^^^' t» * 20, and 40, 

and for criterion W\'\ of * ,8, .9, l-O. the uppej' curve 

on the graph is f Cxj V ^^^^ ' sj^eitroan-iTxim pfOph?<y fonnula; 
the Iwer line is fi^\ \ xy ^^^^^ of values to b^ eXP<?v • i if the 
number of items has >v. F^g"^*^ ^7 shwr5 that the Spearman- 

Brown prophecy formA U f^ilt?^ ^Uowcd. Re«r«ssio^ analysis (of 

for 2n items aA''.\t ;f J^^^'^'^'Xjp k\x yieM<>d a rather 

hif^ coefficient of \ \r«^^i\j,t;^^'' ,94 a"^ » ^egr^isjion ef^uati^on of 



102 



t\^(2n) -.095 > .90 k^^jj(n)^"^, where the variable with tho. superscript 
is the stepped-up coefficient for n items... The above "regression eqaa- 

tion is close enough to the node lejc-itll?!)-^ — ^ ^° S^ve 

^ * ^ TX^"^ ' 
moderate eropiri.cal support for Li ving$ton^>silgebraic derivation. Al- 
though linear regression analyses .'/ere not carried out for tlie no-effect 
model. Figure 37 suggests that the Speaman-Brown model produces .a much 
better fit than would a linear no-effe<it nodel. ' 



2 

Characteristics of Harris's u 



c 



M^, criterion Uvel» and pefcent mastery ■ ^ 

_£ , ^ ^. — . — ■ ' _ • 

In the graphs for each psraaeter set given earlier in inls chapter 

for a and criterion level was the iadepfendent variable. . Criterion 

2 

level was net used for the independent yariablje in the graphs for n^, 
"since Telresu 1 ts WTIu^'sTi^ earlier one (Marshall, 1973) 

showed that is more clearly a fiinction of percent mastery than of 
criterion level. This result follows from an analysis of the fpraula fo? 
given earlier as Equation 5: 

■ c 

2 V 

where the terms in the ratio represent the between-group and within-group 
sums of squares for the groups resulting from the dichotomous classifica- 
tion of a CRT. If two or more criterion levels yield the saije percent 
mastery, there is no change in uj. Np matter what the criterion level, 
if there is only one classification (i.e., if one of the groups has no , 

117 



ERIC 



103 



■embers>>-SSj, » 0 and hence » 0, proviided there is so«e score vari- • 

ance wit-nin the non-em|)ty group. Hence y always approaches 0 as the , 

percent itestery approaches 0, or 1 . ^ , 

Thus in Figures 38 through 45, percient mastery rather than -criterion 
. . \ ' ■ 

level is the in'Sopendeiit variable. As Karris (1972a) points out, there 

are as many sortings into groups, and hence values of percent masteiry, 

as there are test scores with a frequency of One or wore in the scforc — 

distribution^.: , ' , y ' 

(figures 38 through 45 show that the curve for as a function of 
'■ '* ■ 

percent mistery.is quite smooth and clearly aonotonic on either side of 

. • '. • . ■ ■ ' . ■ ^ ' * 

the waptimuis vajlue of m . In fact, it appears that one could concoct 

a non-linear algebraic function of percent mastery (perhaps with some 

additional variables) that would fit the points precisely. Some attempts^ 

were^made during this study to construct such aVunction* Although 

-some fxmcUbn5-yi£Llde4^,,close^ fi^^ an exact fit was not achieved • _ 

' ;■ ' ■ , - ; : 

These findings wil^l shortly be discussed further: 



2 ■ 
u and the score distribution 
c ^ J 

' ' 2 
There appeared to be no relationship between and the' score dis- 
tribution, at least not in the way that the value of B reflects the score 
distribution mode (s), although the maximum value of often occurred 
near the point where there was 50% mastery. 



118 



104 





2 

Figures 38-41: Graphs of against percent, aaster.y, 
' for parameter sets 1-4. 



119 



ERIC 




105 





Figxixes 42-45: Graphs of \i against percent mastery, 
* for paraiaeter sets, 5-8 • 



and basic test statistics / / ^ ^ 

. . — -.-K^- ■ ^ . • - ■ • . 

• * The relationsl),ip of ||9 the basic test staUs^cs vas investi- 

\ > '2 ' ' ' '■ ' ' 

gated by rewjvlng the variance in u* due to changing percent Bastery. 

' " . ■ " ■■■ *■ . "2'': 

This can b» done by taking either the maximm or the »ean value of w^ . 

for each parameter sot . as in the analysis do^ , wiJEh co^f icient beta . 

CFor coefficient beta, ittin(e)..«as <^05en a$ a-varia^^le to study because 

it corresponds to the »od<>s of ifand varies Qyer}jw.oi;«^^^^i^ 

whereas Biax (B) always approaches 1 as ctit^rlon^ lev^ apij|ioach|i-^^ — 

see Figures. IS through 22. For , Bax(w^) «as chosen fnstead^ecause 

it varies over score <iistHbutions7*wh*re«», except fox the truncated 

distributions show in Figuipes 40 4*d 41, »in(yb i^^Y^ approacfte? or 

reaches 0 as percent aastery approaches its exttettes.) The eight score 

distribution types were ranked on sbxCn^) and and xm e^^ch of the 

basic test statistics, and Spearman's rho (rank-order correlation) was 

coaputed (see Tab le 8) . ; >"'..'" 



TABLE 8 

• VAUreS OF SPEARMAN'S PHO (RANK-ORDER CORRELATION) 

2 2 ' ■ •'■ 



BETHfEEN WOL(\lh, AND BASIC TEST STATISHCS 







S 


1^-21 


.48 


.86 


,79 


.88 


.29 


.9S 


.>80 


.93 



■ax(Vg) 

The results sKow that test mean had little relation to (except 

- ' • . 2 

as is discussed liiter>>iwHereas the mean u^? was very highly correlated 



with boVh percent variance arfd >S-21. There w« also * strong positive : 
correlation betveen ) and -both KR-21 aod percent variance. "Ihat ia, 

the greater the^variance (or, K^^^ or ini^lte separatioh) the greater 
the maxisam and .average, values of ..u^ . tifieise relationships are: siailar 
to those betwteen basic test statistics and 8 cfr iaijrv(6) , as r«pprtc«g^ ^ 

earlier. ■ . ' y / c-^" 

aecause^of the eaobthness of the curves of Pigur<^3f 38 throii^ 
att^ts were made to find a^ aigebriiic ftmctlon to describe the relation- 
ship between and the test statistics • Several f'egrtssion oqxiat ions r . 
involving quadratic t^rtas were med, with the independent vaaf^iables of 
test mean/ perciglttr aa^ter^ at the test a:*an, index of separation, per:r- 
cent oaatery which ^produces the psaxiisua valu^j of u^; and -Jxjth linear 
and binomial cot?i|6inations of these. For more than tvo-thiids of. these 
aodels, coe^fficfents of determination were high^ ranging from tp .95* - 
but there was not enough consistency assong regresslon'^efficients to war- 
rant any strong generalisationV In susaaary, visual Tn^pection of the " 
faaily of curves provided just about as auch infoTTtetion as 'these non- 
linear analyses of regression: tliere is a non- linear relationship betw^een 
percent aastery and (and other variables), but an algebraic expression 
of this relationship remains undiscovered. 

In the earlier research cited above CMatrnall/ 1975), fit was stated 

that for bimodal distributions^ seemed, to be very highly correlated 
i c . ' 

■ " ■ ' - 

with pei'ceT)! mastery, and was: related to test s^^zn and percent aastery 

via a bivariate linear regression equation^ : Figures 43 and 4S . 



108 



help explain the inconsistency between that conclusion and the con- 
clusion presented here. The earlier research used criterion levels of 
,6 and higher only^ corresponding roughly to the left halves of these 
graphs. It is now evident that the erroneous conclusion of linearity 
was reached using such incoapiete and unrepresentative data. The earlier 
report /also assertea that the linear relationship was less strong for 
uniEodal distributions, such as that represented fcv figure 44, The rela- 
tionship is clearly non-ld^ear in the left half of that grapn. 

and the nujmber of examinees 

For a given set of test paraxsieters and a given criterion level* 
variation in the number of examinees did not seeta to affect the value of 

u . This was expected since" y is the ratio of sums of squares, and 

C ' c , 

hence increasing the nujabcr of exassinees should affect both terms of 

the ratio equally, 

2 

Figure 46 shows a scatterplot of values of calculated on 2N 

2 

(or, as before, 4N) cxaxnin^ «s against calculated on N examinees. 

Regression aha lysis showed the linear correlation of the pairs of 



values to be very high, .981, The obtained reg:re$sion equation was 
a^(2N) ^ ..0()t781 ♦ .9931 u^(N). close enough to the aodei- 

;s) r p (N) to warrant its acceptance as the model that obtains 

V ■ c 

in the populatioti, 



123 



\ 

\ 

\ 



2 ' • ' 

V and the nuaber of iteas 

c ^ , 

x, \ 

Harris (1972a) indicates that his index i« for «*fixedVlength mastery 
tests/* presumably because there is theoretically no interaiption between 
u and the number of items. Haxris's index is unlike the classical reli-- 



ability measures and the two criterion-referenced indices discussed thus f 
far in this regard. Figure 47 shows a scatterplot of f or J(n items . 
against for n items, with n » 10, 20, and 40; and for critettion levels 
of ,6, i.7, .8, .9,. and l.Oi - \ 

. The linear correlation of this scatterplot wais very high, .^79* The 

obtained regression equation was Mf(2n) ^ -.0721 ♦ ^^G7S pf(n). Ws 

i 2 \ ' 

appearsi different enough from the expected no-effect model of U^(2n) • 

(n) to suggest that another model might be mi>re approp^c-iate, but^ ex- 

perietnce with the (simulated) empirical properties of iftdicatc . another. 

explanation. The data points were generated at five criterion lii^^eis, - 

enumerated above, rather than for a number of values of percent maj^^texy;- , ' 

yet p is more closely ndlated to percent mastery than to criterion \ 

level. Depending on the score distribution, the percent mastery can flucttj- 

ate greatly for a given criterion level. For example, in the data discussed 

here, a criterion level of .8 produced percent mastery values ranging from* 

2 ^ 2 ' ' , 

0 to .81. The mode] M^(2n) « p (n) would more likely be appropriate if 

c c . ' ' ' \ 

the data had boon generated for a set of valued of percent mastery rathexr 
than for a set of values of criterion, level. 



126 



Qiaractcristics of S 



The criterion-referenced index of separation is additive, i>e,^ 

it is the mean of its component parts. The formula was given earlier 
(Equation 7) as 



1 ' 



I 



[141 



S is not a reliability coefficient j but rather is an indicant of how 

c . ^ 

distant the bulk of the scores are frovj the" cutoff score. 



and criterion level 

c 



As Equation 14 shows, there are as' many values of S^'as there are 
values of criterion level* Figurj^s 48 through 55 show the behavior of 
for each distribution as the criterion level varies from \ 05 to 1, The 
relative frequency distribution of totflil scores also appears on each 
Sraph. , 

The curves of appes(t quite smooth -except for those of Figures 
50 and 51, to be discussed shortly.. In general, the index takes on 
lower values than do the other indices reported herein. There appears 
to be no tendency for to approach either 0 or 1 as criterion level 
approaches 0 or I. ^ \^ 

128- 




Figures 48-Sl: Graphs of against critcri^ level, 

with score distribution relative frequencies, 
[ for parameter sets 1-4. 



ERIC 



129 





54 





Figures 52-55: Graphs of S against criterion level, 

with score distriHution relative frequencies 
for parameter sets S-8. 



130 



ERIC 



115 



S and the score distribution 

c ' • 



S seems to reflect tJie aode(s) of the score distribution, as does , 
coefficient beta, but not always in the same way. This is particularly 
evident for extremely skewed or J- shaped distributions, such as are re- 
jrresented by Figures SO and SI. On those giaphs, the value of drops 
sharply to correspond with the equally sharp mode at X « n* . ^ _ 



S and baiic test statistics * 

' c . 

The size of (but not the variance in) the index appears to depend on 
the location of the test mean: the farther at%r the test mean (expressed 
M a percent) is from ,S, the" higher the overall value of the index until 
(as in Figures 50 and 51) the criterion corresponds to the mode. This 
appears to bo the only consistent relationship between and basic test 
statistics. 



S and the number' of examinees 

c ■ 



For a given set, of test parameters and a given criterion level, 
variation in the number o'V^xaminees did not seem to affect the value of 
S . This is reasonable in light of Equation 14, in which the effects 
of increasing the number of examinees should cancel out algebraically. 
Figure S6'show$ a scatterplot of values |f calculated for -2N or 4N 
examinees against calculated for N exkminees, as was donfe for the 
other indices. .. . 

Regression analysis showed the linear correlation of this acatter- 
plot to be unusually high, .997. The obtained regression equation was 



Er|c ' ' 131 




.1 .2 .3 



ERJC 



exaBineis against i^'M 



117 



ERIC 



'S' C2N) » - .006453 ♦ 1.009 S^(N), quite close to the model 

S C2N) » S (N)'. Thus S is not affected by variation in the number of 

examinees. 



S and the number of items 

c 



Figure 57 is u scatterplot of S^. for 2n items plotted again 
for n items, with n and criterion levels as before. 

Figure 57 shows that the points hew to the Uiiear model. Re- 
gression analysis yielded a very high correlation of .997, and a 
regression equation dft;(2n) = - ,01669 * 1.003 S^(n), very close 
to the model S^(2n) = S^(n). Hence S^. unlike certain other indices, 
is apparently not affected by variation in the number of items. 



134 



119 



Relation; Aaiong Criterion-Dependent Indices 

Tuo other. Indices enter into the analysis at this point: The 
coslne-pi estinate O^^^^^j^p of tetrachoric correlation coefficient 
and the phi coefficient (r^). These indices are calculated from the . 
"grand" fourfold table resulting from all possible split-half categoriza- 
tloHA described near the end of Oiapter IV, under which conditions r^ 
is tderiticil to Coefficient kappa, All three indices were defined and 
briefly discussed -in thapterlV. 

One way to suaaarise such of the data is to superia^ose, for each 
parameter set» the Individual graphs of the four indices presented 
earlier plus two ©ore (but note that Is now plotted against criterion 
level rather th^n percent mastery) ► Figures 58 through 65 show values 
of B. U^, S^, r^^^p. and r^, as well as the relative frequency^ 

distributions of tot^l scores, for each •: the eight paraaeter sets, using 
criterion level as the indeperfdcnt variable. In many of the graphs, it 
appears fsSiat these six indices arc' roughly grouped Into three fanilios: 
3, k^^j^ and lit one. r^^^pj an«J i« another, and (with so»e exception'^) 

by i«i§4.f More will be s^id about these apparent interrelate, nships 
later. . 

^totice that r r rxms off t c lower edge of most graphs at the 
cospi ' ^ 

extrewe criterion levels. Thi5 is, due tc the occurrence of an empty 

cell in one of the diagonals of the fourfold table used in cooqputing 

r , hy the fonsula given earlier as Equation 8. When one of these 
cospi ^ * ^ 

diagonal ceUs is empty,, as is often the case at extrccaely low or high 

criterion levels, r^ . is -1, even though the coefficient may have 
^ * cospi 



137 




rigur* 59. indices vm, criterion loveli parAMter mni 2. 



139 



122 



1.0 




l.O 



PJqur« 60. Indices vh. criterion Icvelj p«r«aet«r set 3. 



o SAO 

ERIC ^ V 




Pi9ur« 61. Xndlc«» v». criterion lively paraMt«r ••t 4. 



141 



1 




ERIC 



125 




rliurii 63. tndlcat vb. criterion level; pAraaetcrset 6. 



ERIC 



143 



126 




rlQure 64. tndicefl v«. criterion level, paraweter set 7. 

144 

ERIC V 



' J ■ ' ■■■■ 




quite a different value when the cellTr nearl y eaip ty ^.- Fot-exaapJsu ^ 
Table 9 showS;, for the score distribution corresponding to Figure 64, - 
the proportions within the four cells and the value of r^^^pj for cri- 
terion levels of .9a ami ,95. 

TABLE 9 

EXTREME FLUCTUATIONS IN t^J^. 



Criterion 
Level 


V 

. ^ h 


Proporti* 
B 


an in tiell <^ 
C 


D 


* /^ooapi 


.90 
.95 


.0020 
.0000 


.0145 
.0047 


.0145 
,0047 


.9690 
.9907 


.7097 
-1,0000 



Because of this property of ^^.^j^pj^* the analyses that follow might 

have been substantially altered if these extrene and unrepresentative 
values had been rescored or excluded from the d^ta. 

Coefficient beta and other indices 

Figures 58 through 6S suggest that coefficient beta measures much 
the same thing as does Livingstones k^^^j, at least for unimodal distribu- 
tions. The ti#o indices appear to have similar fluctuations as the 
criterion level varies, and they are generally close in value at each 
criterion level. The major difference is that 3 sensitive to (has 
minima near) the modc(s) of the distribution, whereas k^^j^ is sensitive 
to^ (has minimum at) the mean of scores* Where the mean and mode more 



or less coincide, as In Figures 59. and '64. theToeffiaeMs a^^ 
equal in value. For a biaodal, distribution such as in Figures 63 or 65, 
however, the difference between them is clear. Since a true CRT could 
.well be expected to have a biaodal distribution, this difference between 
the two coefficients is iarportant. , ' 

Stepwise iregression analyses bear out these intuitive arguitents 
(see /^endix D for tables of data). In the regression wpdel with 6 as 
the dependent variable (Table D-1) and with test mean, percent variance, * 



2 



KR-2i, criterion level, and as the Independent v^iables. k- 
WAS Always first to ejrter the regression equation (and hence would be 
closest and nost influential in a ••statistical sociogran*') for >ach of 
the five unimodal distributions, and accounted for between 71% and 92% 
of the variance in e. ./tiso consistent with. the intuitive argujs" ' , 
the aaount of variance^ACCounted for was 71% and 83% for the two distribu- 
tions in which, the nean and mode were sooe distance apart, and was higher 
for the distributions in which they more nearly colhclded. The x-egression 
coefficient was always positive and with but one exception lay between .64 
and .90. For each bimodal distribution, "k always entered the regression 
equation also but was never the first variable to do so, and it accounted 
for very little variance in 6. 

' ■ 2 ' "" ' 

When all unimodal distributions were taken as a group, k was 

again the first va:ciable In the equation and accounted for all but 6% of 
the variance explained by that asodel,; it did not even enter tKe equation 
when all bimpdal distributiorfs were taken as a groi^). ""^ 

147. 



Frooi the above data, which are rather consistent for stepwise re^ 
gression analyses, it se«&;5 * !»asonable to conclude the following: for 
a unioodal testv e and k aeaswe much the sane thing and result in 
similar values, but this relationship is weaker when the oean and the mode 
•re not proximate; for binodal tests, the two indices are sensitive to 
different properties of the score distribution. ' 

Coefficient beta* also has a aoderately strong relationship with 
In the regression analysis discussed above, also always entered the 
regression equation. For each uniaodal distribution it was always the' 
second variable to enter; for two of the three bimodal distributions 
(Figures 63 and 6S), it was the first variable to enter, but accounted 
fdr only 29% and 52% of the variance of Figures 58 through 65 show 
that the curve of over criterion Jlevels did hot gefterm^^^ 
as iKich as .did tJie curve of 0, and generally has a to 
than does B. Nonetheless, they see» to Masure soMwhat sinilar thi^^^ 

When r . and r^ were al lowed to enter the regression equation , 

COSpl P 

the results were not consistent* In one instance (Figure 61) , r^ was 
the first variable to enter and accounted for 95% of the variance, but 
this was a unique situation. Likewise, when all binodaldistHbuti 
were taken as a group, both r^ and t^^^^p^ entered thtf <eK}uation and to- 
gether accounted for about half of the explained variance* However, this 
same pattern did not hold' fdr individual bimodal distributions* . 



X f. , ' 131 



Jc^j|,-*nd other indices 

lihcn Livingston's was the dependent variable in the stepwise 
regression analysis (see Table 0-2 in Appendix ©), the results were less 
consistent than when B was the dependent variable. For instance, S 
did not always enter the equation when all variables were allowed to do 
so, even for unimodal distributibjl¥7— However, it was the first variable 
to enter for four of .the five unimodal distributions ^hen the iitdependent 
variables were restricted to the criteribn-dependent test indices. Also, 
as in the analysis of S, when all unimodal distributions were taken as 
a group, 0 was the first to enter and accounted for 83% of th* variance 
of no matter which variables were allowed to enter the equatipa. 

A similar result occurred when all distributions were taken as a groiip. 
When all blmodal distributions were taken as a group, 8 did not ante? the 
regression equation. Thus it is clear that k^^j^ neasuires much the same 
thing as does B, particularly for unimodaj. distributions. / 
For most of the distributions, also entered the regression equa- 
tion, but the regression coefficients and the amount of variance accoMnted 
for were inconsistent. For the three distributions for whidi was the 
first to enter the equation, (Figures 61, 62 and 65), between 69% and 92% 
of the variance in k*^j^ was accounted for by and the regression coef- 
ficients were all negative, Alsc^when all blmodal distributions were 
taken as a group, ind all criterion-dependent indices were allqwcd to enter 
the .equation, (with a negative regression coefficient) accounted for 

2S% of the variance , in k Ni 
ficient evidence to ge^»erallze. 



25% of the variance' in k^j^. Nonetheless, there doiss not seem to be suf- 



ERIC 149 



^ und other indices ' v , ' ^ 

c ■ . - " 

' * 2 ■ ' 

In the stepwise analysis of rcgressicm with M as the dependent 
variable and the other criterion-dependent test indices *s the imiependerrit 
variables (Table 0-3) .--r. was the fiwt to enter the equation for threis 
Of the distributions (Figures 60, 65 and 65.)- For the other five dis- 
tributions, either 8 or k ^ was, thtt first vaxii^le to enter, and the 

regression coefficients were always negative. This is an indication that" 

- ■ 2 
measures soaething opposite to what 6 (or k ^) aeasure&. For each 

distribution, r^ was always either the first or second variable to enter 

the equation, and the regression coefficient was always positive. 

When uniffiodal, binodal, and ail distributions were taken 

as groups, r^ was also the first entering variable, accounting for 61^, 

94%, and 79\ of the variance, respectively. Hence it se^ clear that, 

particularly for bissodal distributions, and r^ seasure similar things. 

''''' * 

mi other indices 

When all variables (basic test statistics and paraaeters, cri- 
terion levels percent jwstery, and the criterion-^i^wrfent tM^ 
were the free variables in the analysis, the results for S^>«re not 
consistent. However, when this set was restricted to the criterion- 
dependent test indices (see Table 0-4), coefficient beta was the pre- 
doatinont vetrioblc for all but two distributions (Figures 59 and'61), 
suggesting that is in so»e way associated with- 6 (and therefow wltfK 
However, the porcent of variance in 3^ accounted for by 8. was 

■■150 



not alwa/s high. Moreover, when imioodal, biaodal, and all distribu 
tions were cak«n as "groups, the reiuits were inconclusive. 



\ 
\ 



/ 
/ 



151 



CHAPTER VII 

SUIWRY AND SUGGESTIONS FOR FUTURE RESEARCH 
Sunraary • • 
Id Chapter I it was stated thax an Increased acceptan<fe of the 
interx:elated notions of 'behavioral objectives, individualized instruc- 
tion^ and mastery learning has given rise to.new kinds of educational tests. 
One of these new kinds of tests has as its purpose the efficient separation 
of the staple of examinees into.tvo groups, often labeled' "nonmastery** 
iind "Bwstery/* When an examinee has only two courses of action avail- 
ab^e after taking this kind of test-^-stay in the instructional oodule 
covered by the test or go on to studying the next module— his ^*score«' 
need only be reported in teras of this iichoto»y. Further subdivision 
:f the test score scale serves no purpose; the dichotogiy is sufficient 
to allcv a decision leading to action to be made/ A test of this type, 
which usc^ several items drawn from a well-defined universe to neaiure 
a single, narrow behavioral objective, and whose results yield a dicho- 
towous categorization with reference to a predetermined criterion level, 
has herein been called a criterion-referenced test (CRT)* 

In Chapter n, some of the psychometric implications of the dif- 
ferences between ti CRT and the twre familiar nonn- referenced test (NRT) 
were given* It was shown that the purpose, desired score distributions, 
test specifications, construction, and use in decision-making of CRTs 
are not generally the SAoe as for NRTs. It was also shv^ that 

US ■ • . ' . 

153 



136 



the classical and generally accepted matheftatical »odel and assuap- 
tions that underlie the definitions of raditional w ea ww aent- 
erroir and NHT test reliability do not apply to the dichotofjubus derision- 
making facet of a CR.» Thus a new, dual mathematical true-score model for 
CRTs was proposed: a CRT h 5 both a positional facet, concerned with the 
primal measuring process and consistent with the classical assusfptions and 
the continuous true-score model of an NRT, and an operational facet, con- 
ccrncd with the dichotoittous decision-making process and consistent with a 
Platonic (dichotomous) true -score aodel but not with, the classical modeK 
It was further argued that the Aseanings of reliability should be different 
for the two facets of a CRT. Whereas an NRT (or the positional facet of 
a CRT) is reliable insofar as an examinee receives the same score on two 
parallel sets of data, the operational facet of a CRT demands that the 
test must also be reliable in pfar as the examinee receives ^he sane di- 
chotomous categorisation from the two set>5 of data. But ^since a classical 
reliability estimate is inappropriate for this second facet of a CRT, 
what should take its place? ^ 

In Chapter III, an answer to this question is offered* An appro-- 
priate CRT reliability index ought to be founded on the notion of consis- 
tent categorizations. A single-administration coefficient that reflects 
this notion is the mean of all ^^isible split«half coefficients of agree- 
m<xnt, where the coefficient of agreement is the proportion of consistent ^ 
|atcgorizat ions, i. c,, the proportion of entries in the main diagonal of 
a fourfold jna3tcry/nonm;i5tcr>^ contingency tabic. Such an indcx» labeled 
coefficient bctn (B) because of the mean split-half analojjy with 

153 



i 



137 



Cronbach*$ alpha, was derived^ and theoretical and computational for- 
mulas were given. The conputational. adjust^nents required when the 
tejt has an odd nujcober of items were noted. Certain technical charac- 
teri sties of Coefficient beta were mentioned, and B was shown to satisfy 
a list of CRT index criteria that were proposed in Chapter II. Finally, 
coefficient beta was extended to trichotwaous data, and a fornrula for the 
modified coefficient was given. 

In Chapter IV, three other recent criterion-dependent test indices 
were defined-^ -k^.^ (Livingston, 1972a), (Harris. 1972a), and (in- 
troduced in the chapter) and their rationales were briefly discussed. 
Each index was tested against the CRT reliability index criteria prt^posed 
earlier* In addition, the cosine-pi estimate of the tetrachorie correla- 
tion coefficient and the phi coefficient were defined, and it was shown 
that either coeffici'ent can be construed as a single-administration index 
if it is calculated from a fourfold table whose cells contain numbers 
resulting from all possible split-half mastery categorizations. It was 
shown that, under these conditions, the phi coefficient and Cohen's kappa 
coefficient are identical. 

In Chapter V the questions investigated in the study were posed and 
the analytical methodology used to seek'^ answers to them was discussed. . 
The questions dealt- with certain aspects of coefficient beta and the three 
other criterion-dependent indices: their characteristics, their inter- 
relationships, their relationships to basic test statistics, and their 
behavior as criterion level changes and as the number of examinees and 
the number of items increases (and in the latter case, the degree to which 



154 



the Spearman-* Br6wn prophecy formula applies)* The onl/ feasible way 



to carry out this kind of study is with siMilated data« and hence the 
cooputer program that generated the data for this study was described . 
in this chapter. Included ift this discussion were the aqu9tion used by 
the program to generate item**by-pupil response matrices, the available V 
input parameters and output options « and the eighty input i^ar^meter sets 
(and hence kinds of score distributions) that were selected for this 



study. The parameter 5ets were chosen to simulate three types of tests^ 

discussed' in the chapter. , ^ 

. In Chapter VI the results of . the data generation were given in 

graphs and the data w^re analyzed through stepwise analyses of regression/ 

both linear and non-linear^ Characteristics of each of the four crit« 

dependent test indices wer;e given. For example, for all the score dis- 

tribution types studied, consistently moderate to high correlations 

existed between the meah (over criterion level) of each of three of these 

indices and classical reliability (and in the case of p^, percent of-maxi- 

raum varianfce) . . NQne of the foiir criterion-depehdent indices, was affected 

by the number of examinees, which i^ reassuring. However, the indices 

varied in. the degree to which they were affected by changes in the number 

of items. The criterion-referenced index of separation^ S^, and Harris's 

2 

index of efficiency, y, were not affec^^d by the number of items, but 

c > 

2 . . : . 

"B and K^pJ^ were. The Spearman -Brown prophecy formula explained the be- 
2 

havior of k but the behavior of 0 was explained equally well by the. 
Spearman- Brown prophecy model and the (linear) no-effect model. The./ 
empirical evidence showed that the variation in B ats the number of items 



155 



— ittC}r«ascArwasH»est explained by a model that is an algebraic cwjpToroise 

between the Spearman-Brovm and the l o-ef feet modipls. 

^ • ■ ' . • ^ 

Other relationships Were revealed* Perhaps most important and 

clear-cut among them was that for unimodal 5core distributions, coeffi- 

• 2 

cient beta seems to treasure much the same thing as Livingston's k ^ — 
their ^fluctuations over criterion level and their ranges of values were 
genera ny^quitd'^lMl^T^^ for bimodal dlstributiOTTs this relation- ^ 
ship does not hold* The reason is that 0 is sensitive to (has minima 
near) the raode(s) of the scor^ dislrribVition, consistent with , the proposal 
t-hat a CRT relisbility. index should have higher values as the bulk of 
scores depart from the cutoff score,, whereas k is sensitive to (has 
minir^um at) ths5 test meanr' » 

There were moderately; consistent correlations (over score distri^ . 
biition types) between 0 and S^, between k ^» J^^^^ between 

: ' ' ' ■ " 2 ' • - <v ■ 

and r^. Put differently, coefficiertts Bp k ^^^^ and seem to measure 

similar test result attributes, as do and r^ (and therefore k ), 

' ' ^- 2 

However, there i5 a ba:^ic difference between the first group (B, k ^j^, 

■ " ■ 

and S ) and the second group (y" and r J : the indices in the former , 

. . ■ 2 
L, group tend to have higher values (i, in the case of g and k ^j^) at the 

extremes of criterion level -whereas the latter group tend toward a at 

thes-e__sairy?^ 



choose -a "best" reliability coefficient for the operational facet 

of a CRT. one must take into account its nr^fnises, rationale, and charac- 

" ' ■ 2 2 

teristics. Of coefficients 3, > j^* ^c* ^^^y. ^ is sensitive to 

the test mode(s) as distinct from the mean. Thus if it is desired that a 

156 



140 



CRT operational reliability index have higher values as scores depart 
from the cutoff , coefficient beta is the reliability index thiat should 
be used, ^ 



SUGGESTIONS FOR FURTHER RESEARCH 



The following research suggestions are based on the results of 



this study: . . . ^ 

1. Coefficieiit beta increases as the number of items increases, and 

it is the mean<coefficient of agreement calcula:t*ed on all possible halves 

of a- test. These two facts may suggest that B is really a half-test index, 

and that its value shoulduSomfehow be stepped up if it is to be applied to a 

whole test. " ' 

At least three basie-^ approaches could be.made to the steppi'ng-up 
procedure. One approach would be to provide a formula that produces a 
whole-test coefficient as a function of the half-test coefficient, similar 
to the Spearman- Brown prophecy formula or to Equation 12 in Chapter VI. 
Another approach would be. to calculate coefficient beta on a test of 
twice as many items as are ultimately intended to be used and then drop, 
selectively or randomly, half the items. A third approach would be to 
estimate, based on the obtained score distribution, what the score distri- 
but-jron-woulti-be-an- a t e st t w ic e as — long-,— and-^hen calculate B — from-the 



score distribution so estimated. This last approach seems to hold 
promise, and further research results using either a regression, a 
Baycsian, or a binomial model to estimate the double-length score distribu- 
tion could prove fruitful. (See Appendix E for binomial model approach.) 



ERIC 



157 



*■ §3^3^^ Clipter II it was argtted that operational reliability of a CRT 
. r BuVt b^" C'^ncerned with accinracy of placement tOi; categories, and that 
one useful definition of such reliability would be the proportion of 
" classificatibhs which are correct classifications (see Table 3), It^ 
was further suggested that'a meaningful CRT reliability coefficient would 
be a statistic which estimates or is a lower boynd to this proportion. 



Although it^ is intuitively reasonable to suppose that coefficient 
beta is related to this proportion of classifications thaf are correct 
classifications, such a conclusion has not yet been proved mathematically 

^. ■ ■ 

and affords a topic for future research. \ . 

3. Coefficient alpha is equal to the mean split-half classical ineli- 
ability coefficient. Coefficient beta is equal to the mean split-half 

-coefficient of agreement, f^or a gi)ren total score distribution,,:;^ . takes 
on different values for different item-by-examinee response matrices,^ 
and & takes on different values for different criterion levels. Pre- 
liminary research indicates that, for a given response matrix^, the mean 
value of 3 (over criterion leyel) is often close to the computed coeffi- 
cient alphai If may be that, for a given distribution of total scores, 
there is some relation (upper or lower bound? algebraic function? ^ 
equality?) between the mean value of ct (over response matrices) and the 
mean value of 3 (over criterion levelis). This possibility would be in- " 
teresting to investigate. 

4. It was pointed out at the end of Chapter IV (see also Appendix A)^ 
that vifhen the off-diagonal cells in the fourfold stable are equal, the phi 
coefficient (r^j*). and coefficient kappa (k*)^^^^ iTwaVtlie^^^ 

158 



142 



^. ' ♦ 

hypothesized, based on a small jsample of score distributions, that k 
(and thus r.*) is a generally close lowA* bound to ic, the mean split- 
halfvkappa coefficient. If this conjecture can be proved, one could -use 
rZ-^'^quation 10) to obtain a close lower bound to 7. • 

S. At the end pf Chapter III, coefficient beta was extended to sincor- 
porate trichotomous data. It may be that the coefficient can be further 
extended to incorporate data ojtilizing four classifications, or possibly 
"generalized to any number of classifications* Extrapolation from an 
analysis of ^he formulas for 6 and suggests, however, that for an 
n-item test, the maximum number of classifications is y 1. 



159 



REFERENCES , 



American Association for the Advancement of Science Commission on Science 
Education. The psychological bases of science - a process approach. 
Washington, 0. C: American Association for the Advancement of 
Science, 1565 ... 

Baker, F. B. Origins of the item parameters X50 and g as a modern anal- 
ysis technique. Journal of Educational Measurement , 196S, 2, 167- 
180. 

Berger, R. J. A measure of reliability for criterion-referenced tests. 
. : Paper presented at the annual meeting of the Natidrtal Coimcil on 
^Measurement in Education, Minneapolis, March, 1970. 

Blatchford, C. H. Experimental step? to ascertain reliability of diagnos 
■V'f tic tests in English as a second language. Unpublished doctoral 
dissertation, Columbia University, 1970. 

Bloom,. B. S. Learning for mastery. Evaluation Comment , 1968, 1^ (No. 2), 

Brennan, R. L. The evaluation of mastery test items. Final Report, 
Project no. 2B118, National Center for Educational Research and 
Development, U.S. Department of Health, Education and Welfare, 
Washington, D. C., 1974. ' 

Brennan, R. L. § Stolurow, L. M. An elementary decision- process for the 
formative evaluation of an instructional system. Paper presented 
at the annual meeting of the 'American Educational Research Associa- 
tion, New York, February, 1971. ^' 

Carroll, J. a n.nHp:i nf ^;chQQi learning. Teachers Colleg e Record, 1963, 
64, 723-733. ,^ 

Carver R P? Special problems in measuring change with psychometrip 
devices. In B. Baxter (Ed.), Evaluative Research: Strategies and 
Methods. Pittsburgh: American Institutes. for Research, 1970. 

Cochran, W. G. Errors of measurement in statistics. Technometrics , 
1968, 10, 637-666. 

Cohen, J. A coefficient of agreement for nom^^^al scales. Educational 
and Psychological Measurement , 1960, 20, 213-220. 

Cox, R. C. § Vargas, J. S. A comparison of item selection techniques 
'for norm-referenced and criterion-referenced tests. Paper presented 
at the annual meeting of the National Council on Measurement in 
Education, Chicago, February, 1966. , ' 



ifin 



Cronbach, L-V- Coefficient alpha and the internal structure of tests. 
Psychoaetrika , 1951, 16, 292-334. 

•Cronbach, L. J- fi- Gleser, G. > Psychological tests a nd personnel 

decisions . (2nd ed.) Urbana: University of Illinois Press, 1965. 

Darlington, R. B. S Bishop, C. H. Increasing test vaUdity by consider- 
ing interi ten correlations. Journal of Applied Psychology , 1966, 
. S£, 322-330. 

Davis, F. B. ^Item analysis' in relation to educational and psychological 
testing. Psychological Bulletin, 1952, 49, 97-121. 

Developing Mathematical Processes Staff, Resource Manual. Top ics 1-40, 
for Developing Mathematical Processes, Chicago: Rand-McNaliy, 

Donlon, T. F. Some needs for clearer terminology in criterion referenced 
testing. Paper presented' at the annual 
Educational Research Association, Chicago, April, 1974. > 

Evans, J. Behavioral objectives are no damn good. In technology and in- 
novation in education (prepared by the Aerospace Education- Founda- 
tioiri) . New York: Praeger, 1968. ' 

Flanagan, J. C. V proposed procedure for increasing the efficiency of 
objective tests. Journal of Educational Psychology , 1937, 28, 17-21. 

Gagne; R. M. the conditions of learning. New Yorkr.JHoltj Rinehart and 
Winston, 1965. ■ _ 

Gessel, J. Presgyiptive mathematics inventory . Monterey, Cal.: CTB/ 
McGraw-Hill, 1972. , ^. 

0 

Glaser, R. Instructional technology and the measurement of learning 
outcomes: Some questions. American Psychologist , 1963, 18^, 519- 
521. 

Glaser R § Cox, R.C. Criterion-referenced testing for the measurement 
of education;i outcomes. In Weisberger. R.A. (EdO. Instructional pro 
cess and media innovation : Chicago: Rand-McNally,, 1968. 

Goodman, L. A. 6 Kruskal, W. H. Measures of association for cross class- 
ifications. Journal of the American Stati sticaUAssociation. 1954, 
49, 733-764J ' . , 

Goodman, L. A. 6, Kruskal, W. H. Measures of association fdr cross Class- 
ifications: 1 11. Further discussions and references.. Journal of the 
American Statistical Association , 1959, 54, 123-163. 

Guilford, J. P. Fundame ntal statistics in psychology And education . (4th 
ed.) New York: McGraw-Hill, 1965. 



MS 



Haableton, R. K. 5 Novlclc. M. R. Toward an integration of theory and 
Mthod for criterion-referenced- tests. Journal of Educational, 
Measurement » 1973, 10, 159-170. 

' Harris, C. W. An index of efficiency for fixed- length mastery Jests. 
Pa^r presented at the annual meeting of the African Educational 
Research Association, Chicago, April, 1972. (a) 

Harris. C. w; An interpretation of Livingston's reliability coefficient 
for criterion-referenced tests. Journal of E ducational Measurement, 
1972, 9, 27-29. (b) „ ; 

Harris. M. L 5 Stewart, D. M. Application of classical strategies to 
criterion- referenced test construction. Paper presented at the 
annual, meeting of the American Educational Research Association, 
New York, February,' 1971. / ' , 

Hoyt. C. J. test reliability estimated by analysis variance. 
Psychoaetrika ; 1941, 6, 153-160. 

Hsu T-C. Empirical data on criterion-referenced tests, Paper preis^nted 
at the aiTOual meeting of the American Educational Research Associa- 
tion, New York, February, 1971. 

— Ivens,--S.-Jl. -J^ inyestigation oi i m analysis, f ^i^^^^f^;. . 

validity in relation to criteir- tn-referenced „tests. Unpublished 
doctoral dissertation, Florida State University, 1970. 

Klausmeier. H. T.. Quilling, M. R., Sorenson, J. S., Way, R. S. 5 Jlasyud, 
G R Individually gui ded education and the multi-u nit elementary 
school "Ifu'Idelines for implementation. Madison; Wisconsin Research 
and Development Center for Cognitive learning, 1971. ^ 

Klein, D. F. 5 Cleary, T. A. Platonic true scores and JJ^^/J P^y*^^^*" 
^ tiic rating scales. Psychological Bulletin . 1967, 68. 77-80. 

Klein. S. P. S Kosecoff. J. Issues and p rocedures' in the development of 
c;it;r ion-referenced tests . ERIC/TM Report 26. Princeton: ERIC 
Clearinghouse on Tests, Measurement, and Evaluation, 1973. 

- Kosecoff. J. B. 6 Klein. S- P- Instructional sensitivity statistics 
annropriate for objective-based t^st items . CSE Report No. 91. 
Los Angeles: University of California at Los ^geles , Center for the 
Study of Evaluation, 1974.^ , 

Kudef, G. F. Richardson, M. W. The theory of the estimation of test 
reliability. Psychometrika , 1937,^, 151-160.. 



162 

o 

ERIC 



146 



Livingston/ S. A. A.criterion-referencdd application of classical 
tast theory. Jouyhal of Educational Measurwwint , 1972^ gy 13-26. (a) 

Livingston, s: A. A reply to Harris* '.'An interpretation of Livingston's 
reliability coefficient for criterion-referenced tests." Journal of 
EducatioWl Measureaient , 1972, 9, 31. (b) . 

-~— ~ — • ■ . vmpf 

Livingston, S. A. Reply to Shavelson, Blodt and Ravitch's •'Criterion- 
referenced .testing: Cowwnts on reliability." Journal of Educational 
Measurewent . 1972, 9, 139-140. (c> _ 

Lord, F. M. 5 Novick, M. R. Statistical theories of wntal test scores . 
Reading, Mass . : Addison-lfesley, 196». V " „/ ' ; / ' ^ 

Marshall, J. L. Reliability indices idr criterion-referenced tests: 
A study based on siBwlatei data. Paper presented at t^ annual, 
•ieting of the NationiU Council w^^^ , 
Oreloans/ February, 1973. . 

Marshall, J. L. S Haertel, E. H. A single-adjiiinistration reliability 
index for criterion- refdrenced tests: llie oean splitrhalf coefficient 
of afireewent. Paper pitesented at tbe annual the Aiccrican 

Educational Research Asisociatioff, Washington, D.C. , March-April, 1975. 

Mi I Iman, J. Criterion-referenced measurement. In W. J. Pophaa (Ed.), 

Evaluation in education: Current applications . Berkeley: 
. — .lfcGutchaft,_1974 . , . . . 

Nitko, A. J. i^f Bwdel foi(-cylteri<rti-referenced tests '^ased on use. Paper 
presented aKthe annijal meeting of the American Educational Research 
' Association. New York, February 1971. 

Novlck^, M. R. S Le^ls, C. Coefficient alpha and the reliability of 
coiaposite measureaents. Psychoaetrikar . 1967, _32,^ . 

Otto, W. 5 Askov, Rationale and guidelines for the Wisc6nsin Design 
for ReadlT^g Skill Dcvolopment (3rd cd.) MTltmeapolis: National CoB- 
putcr Systems, 1974.: 

Ozcimc , D. G . Toward an evaluative wcthodolftgy for criterion-referenced 
lacasurcs: Test sensitivity . CSF Report No. 72. M>s Angeles: Uni-; 
' versity of CoUforni'a at Los Angeles, Center for , the Study of Evalua- 
. tlx>n, 1971. . - .. ' 

Popham, H. J. Indices of adeqtiacy for criterion- referenced test itejas; , ^ 
- In W. J. Pophaa. (lid,), Cri tcrign- referenced aeasurewont . Englewpod 
Cliffs, N.J.: Educational l^chnology Publications, 1971. 



14? 



ERIC 



Popha«, if.- J, 6 Husek, T. R. Implications of criterion- referenced 
wasureiBcnt. j^oumal of Educational Measurement , 1969, 6, 1-9. 

Ri». E-D. Livingstones reliability coefficient and Harris' in^c* of 
efficiency: L espirical study of the two reliability coefficients 
for criterion- referenced tests. Paper presented at the annual 
jwctlrigs of the Ajwirican Educational Research Asseciarion and xhQ 
National Coui^ii on Measurcssent in Education, Chicago* April, ^y?*^ 

Raiu N S. A note on Livingston's rcHabilit/ for criterion^refcrencea 
tests. Pajier presented at the annual ace ting of the »<ationa2 CouncU 
on Measurement in Education, New Orleans, February, 1973. 

Roudabush, C. e/ Models for a beginning theory of criterion- refertinced 
tests. Paper presented at the nnnml jseeting of the Jfationa; ..ouncl. 
on ^teasu^e^!5ent in Education, Chicago, ApriU VS^^^ - 

Rulon, P. J. A simplified procedure for detereining: the rtUaMluy • 
of a test by split-h.utve§. ^^BjyaT^U^^ 193?. 9. 9i"l0.^ 

Shavclson. R,, Block, J.. 5 Ravitch/M. Criterioh^referenced testing: 
CosBsents on reli ability. Journal of ^duc MigMiJ!£gilL^JE£»i' 
9, 135-137. 

Si»on, C. B, Cosssent/on "lispHcatioa^ of criterion « re ferencad eeasure- 
aent." Journal of Educational .M ea&uragent. 19^9. 6. 2$9-260. 

Stanley, J. C. Reliability. In tV ThpHldiJ^^^!^^^^^^^ 

Heasurcroent. Washington, 0. C. : ,teerican GouncxI on E^ducation. 19. xx, 

^STEPRE Gl: Stepwise Ijnear regr ession, analysig,. tfeidiscnr Onlve-rsity 
oFWisconsirTAcadeaic Cocjput ing Center ,"1975 . 

SvJtcHffc. J~ P. A probability' ssode I for errors of classiflcatica^ i. 
General considerations,. Psychog;etrika, ..1965. 7-3-S6» ^' 

Swaminathan, H., Harabieton, R. K,, 5 Algina, J. - Reliability '^fcrmri^jn- 
referenced jcests: A decision- theoretic fomtlation, Jojirnal^ 
Educational i'toasureaent^ 197A,;il, 26-%267. 

Wedman, 1. Reiiability. validity and discriralnation iseasur^^^^^^ 

criterion- referef^ced tests . EducationalA Reports . Ugga- .C.U UAiver- 
sity, Sweden), 1973. Whole. No. -1. " ^ 



Sufpieasntary AJ|et?raic Deri v-tt ions 



fT->-f '..'-at 



165 



n-l - n; n-x - x*l - 

A-l Proof that I g - I .. 'if • "TP ♦ — 

x«0 x«0 



n-l n- 1 

x»0 X'l 



*fhere, in the las: tciwj x was replaced by x-1. 



.V' . . (i. . "fig • ft|f 



I f - ^ 



T f 



166 



152 



Relation between the two indices of separation 



In Chapter IV. the index of separation of total score^, £, is 
given (Equation 6) as: - 




n » the number of items 



N « number of persons/ and 



X " pth person's total score. 



In addition, the criterion-referenced index of separation of total scores 
S^, is given (Equation 7) as: 

n, N, and X arc as above, f is the frequency of score X in. the distri- 
but ion of scores, and C is the criterion cut-off score. S can be 
shown to be a special case of • If we start with the formulation of 
and substitute y for C , we obtain 




153 




154 



A-3 Equivalence of the phi coefficient (r^*) and coefficient kappa (ic*) 
when off -diagonal ci|;Haaro equal (B»C=«E) 



^ f (A^E) (A-nP ^ (D^E)(D^E) ] 

1- I (A^E) (A-^E) ^ (D^E) jP^E) j 



AN ♦ DN - (A^E)*^ - (D-»E)' 
- (A+E)^ - (D+E)' 



A(A-»D-»2E) -I- D(A-»D-»2E) - CA^E)*^ - (D-»E)' 



(A+D+2E)^ - (A+E)^ 



(D+E)' 



A^ ♦ AD ♦ 2AE *'AD * D^*2DE' - A^ - 2AE - E^ - - 2DE > E^ 

A^ ♦ ♦ 4E^ ♦ 2AD ♦ 4AE ♦ 4DE - - 2AE - E^ - D"^ - 2DE - E^ 



2AD - 2E'' 



2AD ♦ 2AE ♦ 2DE ♦ 2E'' 



2fAD-E'') 
2(A+E) CD+E) 



AD-E 



(A+E) (D+E) 



ERIC 



169 



- APPENDIX B ^ - - ^ 

Graphs of^CX) for oa^;ini Score X, for Selected 
Criterion levels and Number of Items 



iJ(X) for each X o?) a 4-}te?n test for all meaningful 
criterion ievels. v 

^iX) for each X on an 8-item test for three selected 
criterion levels* ^ 

5iCX) for each X on a 16-item test for four selected 
criterion levels* 

i)(X) for each X on a 32-itcp test for sr/en selected 

criterion levels, ; 

«5(X) for each X on a iO-item ti^st for three si::lecced 
criterion levels. 

fi(X) for each X on a test for* fi'y^^ scire tf::d 

criterion levels. ^ 

for each X on a to^itosn test for five <>electcdl 
criterion levels- 



170 



m ^ \ ; ^ _ 




1 2 3 



Criterion level:. .5 

"V ■ 

Figure 8-1., ^(xf for each X on a (-ites test for ail leaningful criterion levels. 




Jtgure 8-^2, Ml] for eadi X on an Mtei test for three selected criterion levels* 




Figure B-3. ♦(X) for each X on a 16-itea test for four selected criterion levels. 



ERIC 



0.2 4 6 8 10 12 14 16 18 20. 22 24; 26 ^28 :30 32 

m ,,.8125 ' ,9375 ; 

Criterion level:' .625 ,75 . .875 . ' :1.0 



Figure B-4, ^(X)' for each X on a 32^iteffl test for seven sjl^cted criterioi} levels, 



I 



hi 




Criicnon ieyel; .& J .8 ,9 1.0 




/ PiTjrc B-6. ^it] tot each )( on a ?0-j.ea tfst for Uve elected criterion levels. 
ERIC 



,4" 



X 



.6 .7 '5 



0 , J i b 8 10 



Figure hi 



each X on 4 40-it«P Ust for five selected^ cirterlon levelfl. ^ 



APPENDIX C 

Computer Program Input Parameter Uistribuj ions .and Subroutines, 
with Notes on Calculation of Vector Components 



Person Competence ' f(| 

t* (Cj, c^,..., Cp, .'.<., Cj^), where N = number of persons 
••1. Chi-square. Calculated, from . ' 



Where V ='a parameter selected to control the shape 
(degrees l^f freedom) 

and c = y • A I, where A is a scaling factor chosen 
P P ' 

so that lihe - maximum- value of Cp coincides with 
a p??ra?neter selected to control i he range. S ^ *- 

. 2. Mirror-image quasi-chi-square . ThisVis calculated as above, 
with each c being rf^pVaced by 1 - c . 
P 

9 

The calculation of the chi-square vector components is similar to that o: 
nonn^il distrihution vector vousponcnts (q.v. for a less technical explanation.)' 

The chi-square distribucion was included as an option trecause empiri- 
cal data from criterion- r fcron'ccd tests suggest that post- ins.tru^ ion 
tnal score di:;tributions often approximate the distribution of the mirror- , 
image Chi ;.qjare. i»rtb\^r. it 5ec;a3 rear.onablc to assume that a population 
th;.. is rrt Knowledgcshle ffi:,.;ht have pre- instruct ton total scort: distr - 

_ ■i'--'-ro:«-. ;r^;,il .. a pvr-r! t jvr* I y skc-wotJ chi-squftrt;. 
♦ N'i> ■ . . Thi'.i c^icuia": i ttm. 



183 



c = y • B + y, where. B is a scaling factor to make tfie 
^ P P components fit within the predetermined 

range, which is itself a parameter selected 
to control dispersion 

and y= ^(^p) is a parameter selected to control location. 

The vector components ar^ not determined by^generating random values, 
thereby necesisitating truncation to make them fit within a range, but 
rather by apportioning the area under the curve according to tRe distribu- 
tion function, and assigning as values the "weighted midpoints" of, the 
iHiife^egments within each of N regions. The operation can be thought 
of as having three steps: first, the ' Ji and standard deviation of the 
normal distribution are defiried; second, the "midpoint" of each segment is 
found (in the cas^ of the two extreme chunks, by finding the points beyond 
which in each direction 1/2N of the area lies) and thirc , a linear 
transformation is applied so that the two extremk val 3S coincide wirh the 
limit: of the predefine'^ range. (Actually^ the range, rather than the 

standard deviation , is defined, hue the computer program merely wotks 

/ ■ " • . 1 

backwards.) \ 

4. Blmodal "'inverse nor ma i/' First a nor^^J distribution vector is 
?,cj..:r:ittJ defined above. Then g transformation is applied <sM adju^ d 
Hic effect of ^he transformat io*: .^nd the adjustments is that of cutti ^ 
the normal di^ii^i'--AU ' M\x a; the r.iddle, tr^tnsUtinp.the Jef^. h:i: 
to the rifihl, ?jn^_ thi^ rignt hi^lf S *o ihft Itfl. , (S^.t F5Rur«y 6 fnr m 

%, MmnUy,. nuii i^"^^ highly fhi^ibJc M^'r'T^JUt in^;^ IftcHui^rd to eo^itSe 

ISO 



167 



one to approximate unusual shapes in the competence distribution. Given a 
distribution transcribed into graph form, with x and /coordinates of up 
to twelve points on the curve such that 0 < < ^2"-'^ \ - ^' inp^^s 
these ordered pairs as parameters. The subroutine calculates the areas 
of the trapezoids under the curve aTid assigns elements of the competence 

vector according'ly . 

6. List. With this option, one can specify the vector components 
by supplying a list of the component values. 

1. Call. Additional, distribution subroutines, such as binomial^, 
can be called into play and used as thfe need arises. Only options' 3 and 4 
were used in this study. 

i , ■ : ' ' _ 

-• Item Difficulty ' 

"d = (dj, d2,..., d^,..., dj^), where n = number of ivems 
1. VHouse,", This s.-so named because the region under the curve 
looks like a child's drawing of a house--an isosceles triangle atop a rec- 
tangle'. Input parameters define thu "corners" and "peak" of the "roof." 
This distrih!:r!on includes the degenerate subcases of uniform (rectangular), 
triangular, a nd^ constant . 

Empirical data suggest that the distribution of iteii! difficulties. -ft i 

. ? ' 

approxirruites son:c ts-pc of "house" distribution. Unifom distributions were 

> * , 

u^rc'^ In this .study , ^ , • 

2 Th^^^ 3S the s^in^^c a- Jcscribcd for the vector of ^.'>rson 

■J. This '5:he car^fC .^s tor 



18'i 



168 



Item Goodness . * 

Distributions and other options for the vector g are the same as for d. 
However , since the' vector components for d and g are generated in as- ^ 
cending numerical order, a subroutine is employed, which randomly. ..permutes 
the vector components by reassigning th5feir subscripts. This is done in 
order to avoid interaction between d, , and g. . - 

In this study, only uniform distributions were used. 

Error Terms ' ' 

All error terms are randomly generated fron^ an internal normal distri- 
bution subroutine, the standard deviation of which can be specified. The 
starting point (within the computer's subroutine) for any of the error 
terms can. be specif i ed, so that identical error components can be generated 
on successive trials if this is wanted. V is wouldj)^ desirable, for ex-' 
ainple, if one wanted to investigate the effect on reliability indices when 
only the item goodness vector is changed. . . - \^ 



188 



er|c , ^ 



APPENDIX D 

SuQoaries of Stepwise Analyses of Regression 



Summary of srep«<ise analysis of reWssioa, with 6 

dependent variabVfi. |\ 

\- ^ \ ■ 2 

Summary of stepwisd analysis of regression, with k 

as dependent variable, \^ 

2 

Summary of stepwise analysis of regression, with 
as dependent variable. 

Sunmary of steowise analysis of ' regression, nth.S^ 
as dependent variable. 



189 



Paraiieier 
Sens) 



Coefficient 
of 

Determination 



unimodal 
(1,2,3,5.7) 

biinodal 

all 
il-S) 



.91 

.94 
.14 

•'I 

• ■ 'f 

.93 
.80 

.65 



,0.' 



.20 



.81 



Coefficients of regression equation 



Constant 



.16 

-.08 

-.06 
11 , 

-..05 
.5.1 



.85 



''3 



KR-21 



(b) 
4] -1.4 



4) -.57 



5) -16,4 



Criterion 
Level 



5) -.098 4) -.36 2) -.14 

2) -.19^ 
S] -.14 4) -.099 2) -.16 



■51 .11. 

1) .11 

3) .58 

2] 1.7 

I] -.05 

2] .15 



3) .032 



(a) 



18 



I) M 88 

1] .64 99 

1) .90 71 

2) 1.56 

( 

1) .29 83 

3) 5.4 31 
1) .88 92 
3) .55 10 



S 



(1) 



2] .40 
2) .95 

2) .05 

3) .044/ 
2) 1.6 

1) 6.1 29 

I] 

1) 3 s: 



1), .90 83 



1) .91 72 



3) .20 
1) .28 15 

3] .17 



(al nercent of variance in t! accounted for by the variable, if > lO'o ■ 
IS ribor set off to the left indicates variable's order of entry into regression equation 



Table M . Summary of 



ste, .ise analysis of regression, with 8 as dependent variable. 

I 



191 



ERIC 



Para^ter 
Set{s) 



3 
4 
S 
6 



URioodal 
(1,2,3,5,7) 

biiBodal 



T,W,V 

all 

(1-8), 



Coefficient 
of 

Dfitcminacion 



.87 
'.81 

,■.93 
.95 
■ .83 

.86 

i, 



Coefficients of tejression equation 



Constant 



■ ".04 

• .31 
.40 

'y 

• .99 

,74 
■.99 

.17 

.97' 

■.41 



6 (a) 



(b) ' 
1) 1.0 

1) .70 90. 

I) .84 71 

M. 

I) ' ,24 92 



1) .&! ' 83 



}) .55 72 



I (a) 



3)-.19 

1) ..044 69 
l)-.27 92 

3) ..24 

1) ..018 83 

4) -.10 

1) -.072. 25 
4) -.23, 



1) ..012 

2) M 



3) .17 

2) .037 30 \ 3) .044 
6) >05S . 3) .24 



1) -.047 93 

2) .14 



CQSpl , 

T"~" 



2) '.034 

2) ..24 11 

> '»• ,.* 

3] ,019 



2,-S) (d) 



2;-S) (d) 




(a) |?rccnt of variance in k^^ accoiifttjid for by .this variable, if > 10^ ■ - ' ' - 

r set off to the lift indicates variable's order of entry into regression equation . 
is the 'only, un'ioodal distribution «hcre / rathfir tl)fl' B -is the first entering variable; however, the 



c 

tion between and 8. for this distribution is -.90 . 
tias the second variable to enter, but left at the 'fifth step 



V 



193' 



Table 0.2, Swiaary of steijvise analjrsis of regression, vlth k ^ as dependent variable. 




(») -peKtiit "f vartance In i\ mmtd to-b)- this variable, If > 10* ■ 

(b^ n«ter « off t. left indicates ».riaMe's .*r »f ,e«ry into regression eflmtion 



^ ERIC 



r Table 0-3. Sunary 



of stepwise analysis of regtession, with ^\ as dependent -varisWc. 



195 



■ » 

Parameter 
Set($) 



I 
2 
3 

4'. 

S ; 



: (1,2,3,5,7) 
' blKoda) 

air 

(1.8) 



Coefficient/ 



/ ,75 ■■ 
' .25 



.78 
,80 
.90 
.92 



.63 
,56 
.57 



Coefficients- of regression cqmion 



Oamm. 



-.14 
.,08 
-1,3 

!,77 

ii2|S 

..43. 

.,22 

•3,0 



-1,6 



.11 . 



4.3 



b' (a) 



1) ,38 63 

2) .33 . 
1) 2,2 25 

1) U3 65 

1) ,90 29 

1) .45. 87 

1)1.0 52 



,2. 



3) 1.9 



4) ,8S 



2)U 



4) 1.8 



5) 2.6 



(c) 



2)10.4 26 



(c) ' 68 



2) $ 
5). 



88 39, 



2j'.75 



2) .67 



3) ,82 



3) .43 



2) 4,04 n ' 



3) ..82 .23 



1) -.38 ,18 



(c) 



icospi W 



2) -,043 10 

3) .,044 



5)'s077- 
4) ,041 ^ 
2) -,016 
4) MAI 



8 



(a) percent of Miance in .accounted for by tM^^^^ 

(b) niaber set of f the left iii;iicates variable's order of entiy into regression «(iuation ' ; , 

(c) this m the firsf variable to enter; but left at the foorth step 

(d) no variables entered into this equation,, and thus there is no coefficient of detewination ; 

,. ' '' ■ , . * - , ■ ♦ . 

Table 0-4; SuBmary of stepwise analysis of Agression, tiith S as dependent variable. 



197- 



APPENDIX E 

r. '*'•■'- '■■ ■ ■ ■ . . ■ ;-' ... '.• " : ■' 

- C. A Binofflial >fcHlel ior'St^^ 

'•' . . ■ • •■ ; ■ , . ' _ * . ■ _ ■ ■ . . . ^ 

It »i»a lioted ^n Chapter ~vil--^^ $ is oqual to the mean pro- 

• portion off agreement on all pbssibie split halves o^ a. test , it 
can |fe considered to be a ^if- test 

srwehow be stepped up in ordbr to t^present^ t^ operational reiiobility 

of a w! 'tie tost. The foriwla presented isT Chapter 6^^^^^ 

purely empirical evidence, and thus is unsatisfying niathematically. 

■ One aatjueaatical approach to the solution to this problea is, 
to use the binoaial probability model. Briefly, the oethod is to 
calculate 6 from an estimated frequency distribution of total 
: scores' for a double-length C2n items) test; based on the obtained 
frequency distributioh fr^ ^b© test 6^ n items, and 

utill^ng the Unoaial probability model to eistimate likelihoods * 
concerning each person 

More specifically, suppose person p receives a scdre of x .vV 
on an n-item test.; Under the binomial model, ~ is the-best estimat^- 
of the prpportion of items in the universe that he would answer ■. :•• 
j^orrect iy ,v ^d hence a Iso^the best, estimate of the proportion of 
items he would answer correctly an a test of 2n items. Let be 
the examinee's score. on this test; Y_ e iO, 1. 2n>. Then the 
probability that person p receives a score of y, i.e;. 



PrCYp - y I'Xp - X) 



198 



ERIC 



(Note- here that 



But there ate persons with score St, and hence the ; 
contribution to y from all thbse with this*>5core is 

. However, a number of dif fer<ftnt scores x will cdntribute to th^. , 
frequency of y, T^ swbihg. mfet all scores tlie frequency 
of score y in the disttdbution isi, V ;^ ; 

V We .hav^e thus ar^^^ at a fflethcKl of calcuU^ 
frequencies of each component of the vector Py^ CpQ, F^, ' ''an^' 
the expected fre scor^^s <>n^^^^^ 

dbufelif -length test" Wft 'can now compute 3 cm the^do)^ble:-ien 

teit:- ' •■ ■ V'"- .■ ■ ' ' : ' ' ' ' 




C-1 2C-2 n+C-I 2n 

jr»0 ^ y«C - ^ y*2C ' ' y«n+C 



where (as before) 

N w number of exaainees; - ^ 

. n « nu0ber of items ^(on the single-* length test) ; 



199 



y « a score on the hyv'othaxicai t^st of ^ itsmy 
C "tha cutoff score on the n-itea test, .and hence the sttiallest 
, integef >. en; . •,■ 



and 



Pv I ^ (/^(hI'^T^ the obtained 



x«0 

frequency of score X on \l)e ii-item test* 

> . _ ■ ■■ . ' 

Note that 

1/ F is generally not an integer; 



y 



2. When x » 0 or x « the quantity 0^ appears in the forwulaticm 
of F , and must be defined as equal to 1; 

3. the second term in the. bra^ v^ishes when C « I; the 
- third texto vanishes when C « n; 

4, the adjustment for odd n is no longer necessary; 

5, an analogous formila holds for $y the stepped -up coefficient 
for trichotoaous data^t 



«^.?Cv ■■■■■ 



t1lh»OI^53tJ*4>iA»t'-w**ir .... 

UnhuHi^y AdtHwnr ComflittH* 

j^^^ll.. rtUMTt. 0*ii«^ 

Dimli fie^ , . 

ft. JV«« IVmil** " 

j, rni^ - 
fSti^rm^ ^ ■ 



— ^^T 



U« Vflo fA* : , , ■ 



; ;A«^»ii»<rt-l»»idT**<*, 



201 



